Howto build a FIR with 3000 points

Non-technical related questions should go here.
User avatar
lilltroll
XCore Expert
Posts: 956
Joined: Fri Dec 11, 2009 3:53 am
Location: Sweden, Eskilstuna

Post by lilltroll »

Best optimazation, well that depend on many things. Is there memory enought to use double data, can we have some extra latecy through the FIR filter etc.

I will push one realisation to github on monday.
For 3000 taps, one FIR thread in reality runs (5n+m)
Testing performance, Running FIR-filter for 1 sec on a single thread with 3000 filter taps
Filtered 6660 samples during 1 second
19980 kTaps per sec.
CRC32 checksum for all filtered samples was: 0xED9B0990
Calculating the CRC32 checksum from the XC implementation, this might take some time
Correct Checksum for filtered datasequence is: 0xED9B0990


4 threads on one core, plus a distributing thread to aviod any extra filter latency runs (5 threads on one core):
This tree structure of implementation should be the worst possible compared to a ring implementation, so this should be worst case numbers.
Testing performance, Running FIR-filter for 1 sec on quad threads with 3000 filter taps
Filtered 25538 samples during 1 second
76614 kTaps per sec.
CRC32 checksum for all filtered samples was: 0x1E2D8273
Calculating the CRC32 checksum from the XC implementation, this might take some time
Correct Checksum for filtered datasequence is: 0x1E2D8273


Using 4 cores you have several choices. One way is to calclutate
sample 1 on stdcore[0]
sample 2 on stdcore[1]
sample 3 on stdcore[2]
sample 4 on stdcore[3]
sample 5 on stdcore[0] :> outputting the filter result from sample 1
sample 6 on stdcore[1] :> outputting the filter result from sample 2

e.g. a latency of 4 samples in the filter.

Another solution is to run maybe 15 or 19 FIR filter threads and one distribution thread to balance the load over the cores. Might be good if you can run one sample ahead.


Probably not the most confused programmer anymore on the XCORE forum.
User avatar
lilltroll
XCore Expert
Posts: 956
Joined: Fri Dec 11, 2009 3:53 am
Location: Sweden, Eskilstuna

Post by lilltroll »

A simple optimazation. The channel fifo is 64 bit long, making it simple to make use of all ALU cycles.


Testing performance, Running FIR-filter for 1 sec on quad threads with 3000 filter taps
Filtered 26511 samples during 1 second
79533 kTaps per sec.
CRC32 checksum for all filtered samples was: 0xEB9762A1
Calculating the CRC32 checksum from the XC implementation, this might take some time
Correct Checksum for filtered datasequence is: 0xEB9762A1


Compared to 19980*4=79920, means that the solution runs at the speed of 99.52% of 4 independent singel FIR - that is not a big penalty. I think I will stop optimizing now .... or maybe not.
Time to attack the asm kernel on each thread.
Probably not the most confused programmer anymore on the XCORE forum.
User avatar
lilltroll
XCore Expert
Posts: 956
Joined: Fri Dec 11, 2009 3:53 am
Location: Sweden, Eskilstuna

Post by lilltroll »

Latest on a G4 is.

Console
Testing performance, Running FIR-filter for 1 sec on 3 cores with 4 threads/core with 3000 filter taps
77221 samples during 1 second
231663 kTaps per sec.


e.g. using 12 threads

Using 15 threads would give 96526 samples/second - distribution penalty, but it might work > 96k sample/s
Probably not the most confused programmer anymore on the XCORE forum.
Post Reply