It was done by -fschedule option when scheduling is enabled,
the assembler may reorder instructions to minimize the number of FNOPs.
+ writing the code in such a way that it became compiled like handwritten ASM.
Instead of writing (using != 0 will make the compiler skip the comparison instruction for sure)
Code: Select all
#pragma unsafe arrays
for(int k=len;k!=0;k--)
{hi,lo}=macs(h[k],x[k+state],hi,lo);
Code: Select all
do {
int d, e;
#pragma loop unroll
for (int s = 0; s < 4; s++) {
d = h[i];
i--;
e = x[ds];
ds = i + state; // or use ds--;
{ hi,lo}=macs(d,e,hi,lo);
}
} while (i != 0); // Skip
lwd - 16bit, read SRAM
subi - 16bit, fetch ldw,add
lwd -16bit, read SRAM
add - 16bit, fetch maccs
maccs - 32bit, fetch branch - make no speculative operations
branch if not zero -16bit, fetch ldw,subi
The maccs get align in the 64 bit instruction buffer and the instruction buffer never gets empty in the loop after a branch. A new instruction fetch can only be done if no read/write operation is performed.
The standard "for loop" compilation results in:
ldw
ldw
FNOP
macss
...
since the instruction buffer becomes empty after 2 ldw in a row after the branch.
Pipeline:
1 decode reg-write
2 reg-read
3 address ALU1 resource-test
4 read/write/fetch ALU2 resource-access schedule
XMOS:idea: the O3 or the -fschedule should recognize a MAC in a for-loop and reorder the instruction in a similar way.
If s=4, the computational burden becomes 5.25N+M for one thread, (N FIR filter taps) which is very good for a GP microcoputer.