Fast bus sniffer

Non-technical related questions should go here.
New User
Posts: 2
Joined: Tue Feb 21, 2017 5:10 pm

Fast bus sniffer

Post by »


I am quite new with the XMOS platform and I will very likely sound naive.

I am currently investigating the possibility / feasability to use a XLF216 to build a protocol analyzer.
I need to oversample both the clock and data lines of a bus running double data rate @ 12.5MHz. I need to capture 2 such busses at a time.
As I read it, my best option is to use a 4 I/O port and deserialise it to a 32 bit integer.
With an over sampling rate of 50 MHz, I get the 32 bit shift register ready for processing at a rate of 6.25 MHz.
I need to add a time stamp to the 32 bit value (I will probably make it 32 bits wide too).
The register and its time stamp will be saved in a 4 kBytes buffer.
Is that manageable? Will the CPU be fast enough. Note that if I can sample @ 100 MHz, it would even be better.

Once the 4k buffer is full, another core will scan it to see if it contains data (not all 0). If data is present, the 64 bit block will be moved to SDRAM.
Another core will get data blocks from SDRAM and move it to USB though a parallel FIFO (I do not plan to use the xCORE with integrated USB for driver availability reasons).
The filter to get rid of the idle bus captures might be a performance bottle neck but is absolutely necessary not to overload the PC app that will decode the bit streams.

Do you see obvious show stoppers that would disqualify the use of an XMOS CPU for this kind of application?
The other obvious candidate for the project is an FPGA but the design would become significantly more complex and expensive.

Thanks for your advises,


User avatar
Active Member
Posts: 43
Joined: Wed Apr 06, 2011 8:02 pm

Post by data »

(Disclaimer: I am not an XMOS employee -- experts may have better answers here. I just found this problem interesting.)

I think your idea has potential. You will certainly need a shared-memory buffer, which you seem to be anticipating.

Also, one buffer probably will not be enough; you will need at least two buffers, so that you can analyse one as the other is being filled.

The main challenge will be that you will have to write very tight code in the receive thread. You will have up to 20 thread cycles to collect and store each 32-bit sample, containing 8 bits of each line (assuming a 125MHz thread). The receive thread will have to run continuously and it will not have time for communicating with other threads -- you'll have to start it and let it run. I think it should be possible to do that in 20 cycles, but assembly language may be required.

I recommend instead using 1-bit ports for each line. You can synchronise them with each other, and you will have 80 cycles to collect and store the four samples (125MHz threads, sampling at 50MHz). This isn't any more cycles per sample but it will be easier to write the code -- the hard real-time deadlines occur only one-fourth as often. Also, the resulting data will be easier to analyse, as you won't have to unshuffle it.

Also, using 1-bit ports would allow you to use two threads instead of one, with a thread dedicated to each pair of lines. That would double the number of thread cycles per sample, to 40 (still with 80 cycles between port call pairs). You would probably not need assembly for this. The two threads would be identical, so you could write one thread and instantiate it twice. You could still analyse and store the data in a single thread.

Since the buffer is continuous and the data will be collected synchronously (no pauses or gaps), I don't see a need to timestamp every sample (unless you have some specific requirement of course). The analysis thread can timestamp the buffers, which would give the same information.

Finally, in this application it may be worth considering a bit of external logic, to eliminate the need for oversampling. This is how XMOS deal with the double-clocked RGMII interface.
XCore Addict
Posts: 230
Joined: Wed Mar 10, 2010 12:46 pm

Post by peter »

data made a number of good points about using 1-bit ports instead of a 4-but port that mean it will be easier to deal with if you have enough 1-bit ports available. If you have to use a 4-bit port then it is worth looking at the UNZIP instruction which efficiently extracts bits, nibbles, etc into separate registers.
Respected Member
Posts: 346
Joined: Wed Jan 27, 2016 5:21 pm

Post by henk »

Hi Xavier,

Data has answered all the critical issues.

I did try a similar project in the past, using the sigrok libraries to interface it on the host side. As hardware on my side I reappropriated an XTAG3, creating a 6-channel "logic analyser". The sample rate was severely limited on what I made (10 MHz?). This project was pre xCORE200 and was limited by a lack of memory and MIPS. xCORE200 has the extra memory and MIPS to possibly make something much more interesting.

There are three things to consider: the sample rate, the connection back to the host, and whether you are going to have some sort of trigger mechanism in your device.

Sampling a 100 MHz signal on 4 or 8 bits port is not an issue at all electrically, provided the signals are good (you may need expensive scope probes, or at least some buffering close to the source of the signal). Four bits at 100 MHz results in 400 Mbits per second, which is not too arduous for a single thread (the USB library catches 480 Mbits/s); 8 bits at 100 MHz results in 800 Mbits per second, which can be done but needs careful thinking (the RGMII RX threads catch 1000 Mbits/s). The code to do this requires something like IN; AND; STD; IN; ADD; BT which dual issues into four slots and at 16 ns per instruction (500 MHz all threads running), gets you 64 bits every 64 ns or 1000 Mbits/s.

For the host interface, I would either use an GBit ethernet or a USB interface to the host; that enables you to get rid of your data with a decent bandwidth.

A bigger question is whether you could implement a trigger; signal X high, or X high whilst Y is low; and on the trigger you ship the samples starting from, say, 64 kbyte before the trigger up to, say, 64 kbyte after the trigger. For a simple trigger (signal X high or low) this must be possible. For more complex triggers you may be able to use all left over threads and run them in parallel over the data. Each would be looking at a subsequent pair of words.

Keep us posted - interesting project.

New User
Posts: 2
Joined: Tue Feb 21, 2017 5:10 pm

Post by »

Hello All,

Thanks a lot for the feedback.
I agree that moving to one bit port is the way to go.
That was also my conclusion.
I also designed external logic to generate a clock pulse every time the bus clock or bus data has a transition.
This will be my port clock and it should not be higher than 25 MHz.
I also noticed in the XMOS data sheet that the data needs to be stable at least 8 ns after port clock edge.
My bus has a hold time of minimum 3 ns which can be troublesome in some cases.
This forced me to add extra delays on the sampled clock and data line (about 15 ns to be on the safe side).
I was planning to use the deserializer of the port with INT32 in order to lower the rate at which the thread needs to process data.
If my port clock is 25 MHz, it gives me a process rate of 781kHz, which is a lot more doable (and I can even share the processing in multiple threads as Data suggested).

Now comes my next big worry...
The bus transmits data in burst (a bit like I2C).
The deserializer of the port is clocked by bus activity.
If the bus transaction is over, the bus is idle and the clock / data lines are steady.
The amount of data transmitted on the bus can get any number (say 43 bits for instance)
What happens to the deserializer?
Will it get stuck and freeze the thread that is using it or is there a possibility to have a timeout to capture whatever has been shifted in the port shift register?

To summarize:
1. Oversampling allows me to use the deserializer without any trouble as there is always a clock to run the port but generates a lot of data to be processed & transmitted over USB. Difficult but likely doable.
2. Bus based sample clock lowers the amount of data and processing but eventually triggers difficulties with the port deserializer. If the deserializer is not usable, the processing rate of 25 MHz makes it difficult to handle. I have to shift the read bit in a INT32 and when the INT32 is full, move it to a buffer, all of this within 40ns...

If there is a workaround on option 2 to be sure I do not get stuck in accessing the port, I take option 2.

Thanks for your advises,

Posts: 1
Joined: Sun Apr 09, 2017 11:18 am

Post by larytet »

I wonder if a single device solution is mandatory. Is there a cost/space constrain?