Howto build a FIR with 3000 points

Non-technical related questions should go here.
dirk1980
Active Member
Posts: 32
Joined: Fri Oct 07, 2011 3:20 pm

Howto build a FIR with 3000 points

Post by dirk1980 »

Hi,
i found the XMOS and i'm thinking about, if i can use it for my new Project.

I have a ADC (24 Bit 96KHz mono) and i need a FIR with 3000 points (10us time).
Outpout is SP/Diff or I²S.

Is this possible with a XMOS System?

Dirk


User avatar
lilltroll
XCore Expert
Posts: 956
Joined: Fri Dec 11, 2009 3:53 am
Location: Sweden, Eskilstuna

Post by lilltroll »

For a 400 Mhz device, each thread can process 20 Mtaps per second. If you use even or quad length you may come close to 25 Mtaps per sec.

You need 3k*stereo*100k = 600 Mtaps per second, so you would need many threads.

If you can accept some latency it is common to use FFT instead. But then you need some memory instead, for say 8192 FFTs.

The Xilinx VI FPGA with DSP extension can do 200 Mtaps per second @ 400 Mhz, and if you choose one with many multipliers, it can handle several FIR filters in parallel. This if you cannot accept latency.

They also have the very new Zynq-7000, the most heavy edition handles 912 GMACS for a symmetrical FIR filter. And yes it is Giga not Mega.

Anyway if you can accept some latency the BOM will be much cheaper, compared to 10-20 us latency though the filter.

A dual core 1.5 Ghz DSP optimized for FIR will also do the job, and that is about state of the art for the moment out there. If a dual core ARM Cortex A8 @ 1.5 Ghz can do it, maybe I do not know, check the latest from Texas Instruments. The pipeline and memory-fetch might be a problem.
Probably not the most confused programmer anymore on the XCORE forum.
dirk1980
Active Member
Posts: 32
Joined: Fri Oct 07, 2011 3:20 pm

Post by dirk1980 »

It's MONO so i need just 300Mtaps

4 Cores * 4 Threads -> 4 * 80Mtaps = 320Mtaps
And Some input and output stuff.
It's not the best way but i will try it.

It's the cheapest way i found!

A 8192 point FFT & Signal shaping is faster on XMOS?


EDIT:
Can i run 1 thread with 400MHz? Or must i split the FIR in 4 threads?

Dirk
User avatar
Bianco
XCore Expert
Posts: 754
Joined: Thu Dec 10, 2009 6:56 pm
Contact:

Post by Bianco »

EDIT:
Can i run 1 thread with 400MHz? Or must i split the FIR in 4 threads?

Dirk
You will need at least 4 threads to get all MIPS.
dirk1980
Active Member
Posts: 32
Joined: Fri Oct 07, 2011 3:20 pm

Post by dirk1980 »

But some one has written that he has done something like this.
Maybe not exactly that what i need.

I can't find it, sorry.
Last edited by lilltroll on Sat Oct 08, 2011 2:32 pm, edited 1 time in total.
Reason: opps
bearcat
Respected Member
Posts: 283
Joined: Fri Mar 19, 2010 4:49 am

Post by bearcat »

FIR filters are pretty trivial to implement, if you are talking single precision 32 bit coeff. Yes you will need to split the filter across 4 threads to achieve the most taps and again a FIR is trivial to split up. Using channels, this is very easy to implement across 4 threads, and multiple cores.

Each tap requires 3 - 3.1 instructions. Using a 500Mhz device, this gives 125M instructions = 40MTaps / sec x 4 threads = 160MTaps / sec per core MAX. Using two cores, this is probably achievable including I2S output using slave mode for the XMOS. Either a G4, a L2, or two L1's. Some hardware issues to design for.

Probably only a couple hundred lines of code total here including I2S output.
User avatar
lilltroll
XCore Expert
Posts: 956
Joined: Fri Dec 11, 2009 3:53 am
Location: Sweden, Eskilstuna

Post by lilltroll »

bearcat wrote:
Each tap requires 3 - 3.1 instructions.
Probably not.

A Load Word Instruction is a memory access and thus doesn't fetch a new instruction to the instructionbuffer.

The way of doing it with preoffset of the memorypointer to the array looks like this
https://gist.github.com/1216283

or

Code: Select all

subloop:
maccs ynh,ynl,h,x
entrypoint:
ldw x,Xoff[i] //Xoffset = X-1*int32
sub i,i,1
ldw h,H[i]
bt i,subloop
And this is 5 instructions in the inner loop.

using the double data method to create a linear buffer from the circular one.
Probably not the most confused programmer anymore on the XCORE forum.
User avatar
lilltroll
XCore Expert
Posts: 956
Joined: Fri Dec 11, 2009 3:53 am
Location: Sweden, Eskilstuna

Post by lilltroll »

dirk1980 wrote:It's MONO so i need just 300Mtaps

4 Cores * 4 Threads -> 4 * 80Mtaps = 320Mtaps
And Some input and output stuff.
It's not the best way but i will try it.

It's the cheapest way i found!

A 8192 point FFT & Signal shaping is faster on XMOS?


EDIT:
Can i run 1 thread with 400MHz? Or must i split the FIR in 4 threads?

Dirk
I thought a little more about it. I think you can do it with 4096 FFTs and IFFTs
It works like this.
We collect a block of 4096-3000=1096 points of audio data, zeropad it and run a FFT on it.
The input signal x is now Fourier-transformed to the signal X in the frequency domain.
Your impulse response h is pretransformed and thus H.

Instead of convolution we can perform Y=X*H in the frequency domain, e.g. 4096 (complex) multiplications.
Finally we do the inverse FFT to go back to the time-domain y. You will now have 1096 points of filtered data + 3000 points of the tail in the vector. In the next step this is overlapped and add, and the final result is the same as FIR - http://en.wikipedia.org/wiki/Overlap_add

This would thus be done around 96 times per second, and give you a delay of 1/96 of a second.
Instead of 300 MTaps/s which is > 1500 Minstructions/s, you "only" need to do 96 times this FFT IFFT stuff.

You can find an existing FFT lib on github, but that is not optimized for hi sound-quality. Do you expect 120 dB SNR out of the filter ? It might be possible with FFT on XMOS but maybe we have to store 64 bits words in memory for the FFT, I cannot say on the fly. (FFT creates a little noise due to rounding errors)

The computational burden for 96FFT + 96 IFFT is very less compared to Minstructions/s, but uses more memory. Using a 4096 long int32 is already 16kbyte of memory. You need 2 vectors with complex data, this is X[4096] (int32_complex) 32 kbyte H[4096](int32_complex) 32 kbyte. And you all already of of memory on one core.

You need to distribute the FFT or maybe survive with using the a
FFT algorithms specialized for real and/or symmetric data.

Se http://en.wikipedia.org/wiki/Fast_Fouri ... etric_data
Probably not the most confused programmer anymore on the XCORE forum.
User avatar
lilltroll
XCore Expert
Posts: 956
Joined: Fri Dec 11, 2009 3:53 am
Location: Sweden, Eskilstuna

Post by lilltroll »

An example on an Intel Core i5 running on MATLAB @ one core 2.67 GHz.

I have 3000 points of data in h, and x is 100s long @ 96 kHz.

tic;filter(h,1,x);toc
Elapsed time is 16.062203 seconds. standard FIR


With 4096 long FFT
tic;fftfilt(h,x,4096);toc
Elapsed time is 2.649616 seconds.

With 8192 long FFT
tic;fftfilt(h,x,8192);toc
Elapsed time is 1.197416 seconds.
Probably not the most confused programmer anymore on the XCORE forum.
User avatar
lilltroll
XCore Expert
Posts: 956
Joined: Fri Dec 11, 2009 3:53 am
Location: Sweden, Eskilstuna

Post by lilltroll »

dirk1980 wrote:But some one has written that he has done something like this.
Maybe not exactly that what i need.

I can't find it, sorry.
Well, a G4: the I2S running on one thread including distributing and collecting data out to 15 other threads
Use my ASM example, it uses streaming channels.

It might be possible at full load. Is ending up with 2900 taps a disaster ?
Probably not the most confused programmer anymore on the XCORE forum.
Post Reply