Thread priority - is there any option?

Technical questions regarding the XTC tools and programming with XMOS.
User avatar
lilltroll
XCore Expert
Posts: 956
Joined: Fri Dec 11, 2009 3:53 am
Location: Sweden, Eskilstuna

Post by lilltroll »

FNOPs

It was done by -fschedule option when scheduling is enabled,
the assembler may reorder instructions to minimize the number of FNOPs
.
+ writing the code in such a way that it became compiled like handwritten ASM.
Instead of writing (using != 0 will make the compiler skip the comparison instruction for sure)

Code: Select all

#pragma unsafe arrays
		 for(int k=len;k!=0;k--)
		 {hi,lo}=macs(h[k],x[k+state],hi,lo);
I wrote

Code: Select all

		do {
			int d, e;
#pragma loop unroll
			for (int s = 0; s < 4; s++) {
				d = h[i];
				i--;
				e = x[ds];
				ds = i + state; // or use ds--;
				{	hi,lo}=macs(d,e,hi,lo);
			}
		} while (i != 0); // Skip
:!: Please review my understanding here, it might be wrong


lwd - 16bit, read SRAM
subi - 16bit, fetch ldw,add
lwd -16bit, read SRAM
add - 16bit, fetch maccs
maccs - 32bit, fetch branch - make no speculative operations
branch if not zero -16bit, fetch ldw,subi

The maccs get align in the 64 bit instruction buffer and the instruction buffer never gets empty in the loop after a branch. A new instruction fetch can only be done if no read/write operation is performed.

The standard "for loop" compilation results in:
ldw
ldw
FNOP
macss
...

since the instruction buffer becomes empty after 2 ldw in a row after the branch.

Pipeline:
1 decode reg-write
2 reg-read
3 address ALU1 resource-test
4 read/write/fetch ALU2 resource-access schedule


XMOS:idea: the O3 or the -fschedule should recognize a MAC in a for-loop and reorder the instruction in a similar way.

If s=4, the computational burden becomes 5.25N+M for one thread, (N FIR filter taps) which is very good for a GP microcoputer.


Probably not the most confused programmer anymore on the XCORE forum.
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am

Post by segher »

Ah, fetch noop, I didn't get that from context.

Is the "s" loop not fully unrolled? It loops a fixed four times and the loop body
is very short, so the compiler should do that.

Then, it gets to hoist the additions out of the loop, so you get two LDWs and
a MAC only, per iteration of the inner loop. That gives a bubble, but you should
be able to put the bookkeeping insns from the outer loop in there.

That would make 4 cycles per inner loop iteration. Doing better requires quite
more massive surgery I suppose ;-)
User avatar
lilltroll
XCore Expert
Posts: 956
Joined: Fri Dec 11, 2009 3:53 am
Location: Sweden, Eskilstuna

Post by lilltroll »

Issued FNOPs can be viewed in XTA or in a simulation.
Then, it gets to hoist the additions out of the loop, so you get two LDWs and
a MAC only
I do not get how to fix the array index update outside the loop,

LDW
LDW
MAC
LDW
LDW
MAC

cannot run without FNOPs, since only the MAC makes a fetch. You have to make a fetch after each LDW
so it is best to use the fetch operations to something good like index update:

1 LDW no fetch
2 sub|add & fetch(3,4)
3 LDW no fetch
4 sub|add & fetch(5)
5 MAC&fetch(1,2)
1 LDW no fetch
2 sub|add & fetch(3,4)
3 LDW no fetch
4 sub|add & fetch(5)
5 MAC (includes fetch)

(The example includes double buffering of the data outside the loop)

Show me how to do it in 4? Sorry for the wrong number earlier, I meant 5.25.
7N+M is a standard for DSPs computing FIR, since you need to branch but also calc the reminder of the index using a single circ. buffer using reminder or a branch.
Probably not the most confused programmer anymore on the XCORE forum.
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am

Post by segher »

The unrolled inner loop is

Code: Select all

LDW ; LDW ; MACCS
LDW ; LDW ; MACCS
LDW ; LDW ; MACCS
LDW ; LDW ; MACCS
which has some FNOP bubbles, running every line in 4 cycles.

The outer loop needs some bookkeeping for adjusting the loop counter and the
pointer(s) and possibly the loop end condition; you can put those instructions
where the FNOPs would be. So something like

Code: Select all

loop:
LDW a,0[p] ; LDW b,0[q] ; MACCS a,b
LDW a,1[p] ; LDW b,1[q] ; MACCS a,b
LDW a,2[p] ; LDW b,2[q] ; MACCS a,b
LDW a,3[p] ; LDW b,3[q] ; MACCS a,b
ADD p,p,sixteen
ADD q,q,sixteen
SUBI n,n,1
BRT n loop
becomes

Code: Select all

SUBI q,q,4
SUBI p,p,4
loop:
LDW a,1[p] ; LDW b,1[q] ; MACCS a,b
LDW a,2[p] ; SUBI n,n,1 ; LDW b,2[q] ; MACCS a,b
LDW a,3[p] ; ADD p,p,sixteen ; LDW b,3[q] ; MACCS a,b
LDW a,0[p] ; ADD q,q,sixteen ; LDW b,0[q] ; MACCS a,b
BRT n loop
(and you can hide the BRT in the last FNOP bubble, but that is slightly more work
with staggering the loop, so exercise for the reader ;-) )

EDIT Lilltroll: Ahaa, there is a LDWI instruction available using ints from 0 .. 11
mculibrk
Active Member
Posts: 38
Joined: Tue Jul 13, 2010 2:57 pm

Post by mculibrk »

Huh! Excellent hacking!

...now that you have a so nicely polished code add there "thread scheduling 'glitches'" (when ore than 4 threads active on a single core)... ;)

(PS. don't get me too serious, just trying to point out the original "question/issue" I was asking about.)

regards,
mculibrk