Thread priority - is there any option?

Post by **lilltroll** » Wed Apr 13, 2011 6:24 pm

FNOPs

It was done by -fschedule option when scheduling is enabled,
the assembler may reorder instructions to minimize the number of FNOPs.
+ writing the code in such a way that it became compiled like handwritten ASM.
Instead of writing (using != 0 will make the compiler skip the comparison instruction for sure)

Code: Select all

#pragma unsafe arrays
		 for(int k=len;k!=0;k--)
		 {hi,lo}=macs(h[k],x[k+state],hi,lo);

I wrote

Code: Select all

		do {
			int d, e;
#pragma loop unroll
			for (int s = 0; s < 4; s++) {
				d = h[i];
				i--;
				e = x[ds];
				ds = i + state; // or use ds--;
				{	hi,lo}=macs(d,e,hi,lo);
			}
		} while (i != 0); // Skip

:!: Please review my understanding here, it might be wrong

lwd - 16bit, read SRAM
subi - 16bit, fetch ldw,add
lwd -16bit, read SRAM
add - 16bit, fetch maccs
maccs - 32bit, fetch branch - make no speculative operations
branch if not zero -16bit, fetch ldw,subi

The maccs get align in the 64 bit instruction buffer and the instruction buffer never gets empty in the loop after a branch. A new instruction fetch can only be done if no read/write operation is performed.

The standard "for loop" compilation results in:
ldw
ldw
FNOP
macss
...

since the instruction buffer becomes empty after 2 ldw in a row after the branch.

Pipeline:
1 decode reg-write
2 reg-read
3 address ALU1 resource-test
4 read/write/fetch ALU2 resource-access schedule

XMOS:idea: the O3 or the -fschedule should recognize a MAC in a for-loop and reorder the instruction in a similar way.

If s=4, the computational burden becomes 5.25N+M for one thread, (N FIR filter taps) which is very good for a GP microcoputer.

segher · Post by **segher** » Wed Apr 13, 2011 9:19 pm

Ah, fetch noop, I didn't get that from context.

Is the "s" loop not fully unrolled? It loops a fixed four times and the loop body
is very short, so the compiler should do that.

Then, it gets to hoist the additions out of the loop, so you get two LDWs and
a MAC only, per iteration of the inner loop. That gives a bubble, but you should
be able to put the bookkeeping insns from the outer loop in there.

That would make 4 cycles per inner loop iteration. Doing better requires quite
more massive surgery I suppose ;-)

Post by **lilltroll** » Thu Apr 14, 2011 3:13 am

Issued FNOPs can be viewed in XTA or in a simulation.

Then, it gets to hoist the additions out of the loop, so you get two LDWs and
a MAC only

I do not get how to fix the array index update outside the loop,

LDW
LDW
MAC
LDW
LDW
MAC

cannot run without FNOPs, since only the MAC makes a fetch. You have to make a fetch after each LDW
so it is best to use the fetch operations to something good like index update:

1 LDW no fetch
2 sub|add & fetch(3,4)
3 LDW no fetch
4 sub|add & fetch(5)
5 MAC&fetch(1,2)
1 LDW no fetch
2 sub|add & fetch(3,4)
3 LDW no fetch
4 sub|add & fetch(5)
5 MAC (includes fetch)

(The example includes double buffering of the data outside the loop)

Show me how to do it in 4? Sorry for the wrong number earlier, I meant 5.25.
7N+M is a standard for DSPs computing FIR, since you need to branch but also calc the reminder of the index using a single circ. buffer using reminder or a branch.

segher · Post by **segher** » Thu Apr 14, 2011 1:22 pm

The unrolled inner loop is

Code: Select all

LDW ; LDW ; MACCS
LDW ; LDW ; MACCS
LDW ; LDW ; MACCS
LDW ; LDW ; MACCS

which has some FNOP bubbles, running every line in 4 cycles.

The outer loop needs some bookkeeping for adjusting the loop counter and the
pointer(s) and possibly the loop end condition; you can put those instructions
where the FNOPs would be. So something like

Code: Select all

loop:
LDW a,0[p] ; LDW b,0[q] ; MACCS a,b
LDW a,1[p] ; LDW b,1[q] ; MACCS a,b
LDW a,2[p] ; LDW b,2[q] ; MACCS a,b
LDW a,3[p] ; LDW b,3[q] ; MACCS a,b
ADD p,p,sixteen
ADD q,q,sixteen
SUBI n,n,1
BRT n loop

becomes

Code: Select all

SUBI q,q,4
SUBI p,p,4
loop:
LDW a,1[p] ; LDW b,1[q] ; MACCS a,b
LDW a,2[p] ; SUBI n,n,1 ; LDW b,2[q] ; MACCS a,b
LDW a,3[p] ; ADD p,p,sixteen ; LDW b,3[q] ; MACCS a,b
LDW a,0[p] ; ADD q,q,sixteen ; LDW b,0[q] ; MACCS a,b
BRT n loop

(and you can hide the BRT in the last FNOP bubble, but that is slightly more work
with staggering the loop, so exercise for the reader ;-) )

EDIT Lilltroll: Ahaa, there is a LDWI instruction available using ints from 0 .. 11

mculibrk · Post by **mculibrk** » Thu Apr 14, 2011 4:05 pm

Huh! Excellent hacking!

...now that you have a so nicely polished code add there "thread scheduling 'glitches'" (when ore than 4 threads active on a single core)... ;)

(PS. don't get me too serious, just trying to point out the original "question/issue" I was asking about.)

regards,
mculibrk

Thread priority - is there any option?

Re: Thread priority - is there any option?

Re: Thread priority - is there any option?

Re: Thread priority - is there any option?

Re: Thread priority - is there any option?

Re: Thread priority - is there any option?