Response time to external events!

Technical discussions around xCORE processors (e.g. xcore-200 & xcore.ai).
Post Reply
gothed
Junior Member
Posts: 7
Joined: Wed Nov 06, 2013 1:50 am

Response time to external events!

Post by gothed »

Hi everybody,

I am interested in comparing the XMOS response time to external events against more traditional cores (read AVR). The setup is incredibly simple:

I drive a pin, named RX, on the XMOS high via an external micro controller, then I respond by driving a pin, named TX, high. A logic analyzer running at 100 MHz is connected to both the TX and RX pin. I have found that it takes the XMOS, typically, 100 nS to respond to the external event (equal to two I/O CLK cycles). On occasion the XMOS takes 240 nS to respond (5 I/O CLK cycles?).

I would like to know from everybody here if the code I have written results in a fair representation of the XMOS response time, or if there are better/more efficient ways to do this.

Code: Select all

#include <xs1.h>

in  port RX  = XS1_PORT_1E;
out port TX  = XS1_PORT_1F;

int main (void) {
	int val = 1;
	while (1) {
	  RX when pinseq(val) :> void;
	  TX <: 1;
	  TX <: 0;
	}
	return 0;
}
Please let me know what you think: are these numbers a fair representation of XMOS?

Thanks - Dominik


User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am
Contact:

Post by segher »

Hi there,

What do you call "I/O CLK"? The clock your ports use runs
at 100MHz, that is 10ns per tick.

Your XC code looks fine, but your numbers are way off.
Did you compile with optimisation disabled, perhaps?

If you look at the generated code (with xobjdump for example)
you should see an IN instruction immediately followed by
two OUT instructions.
gothed
Junior Member
Posts: 7
Joined: Wed Nov 06, 2013 1:50 am

Post by gothed »

You are right, I remembered incorrectly, the I/O CLK does run at 100 MHz.

It looks like I never flashed the device (oops). I have now managed to actually flash the device and not run it in Debug mode anymore. Response times are now down to 50 ns. Is that what we would expect?

The main function looks like this:

Code: Select all

           0x100f4 	mkmsk (rus) r0, 0x1
           0x100f6 	ldw (lru6) r1, cp[0x1]
           0x100fa 	setd (r2r) res[r1], r0
           0x100fc 	setc (ru6) res[r1], 0x11
           0x100fe 	ldw (lru6) r2, cp[0x0]
           0x10102 	ldc (ru6) r3, 0x0
           0x10104 	in (2r) r11, res[r1]
           0x10106 	out (r2r) res[r2], r0
           0x10108 	out (r2r) res[r2], r3
           0x1010a 	bu (u6) -0x4 <0x10104>
Thanks - Dominik
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am
Contact:

Post by segher »

It doesn't matter if you boot from flash or start your program
using JTAG; the same code runs, in the same way.

Your generated code looks fine, you have optimisation
enabled.

Your code only does what you expect the first run through
(you don't wait for RX to go low again), maybe that is what
is throwing your measurements off?
gothed
Junior Member
Posts: 7
Joined: Wed Nov 06, 2013 1:50 am

Post by gothed »

EDIT: Ignore this post. There were grounding problems, they are now solved. Please refer to the next post.

Something here is just not right, could you please explain why the following happens:

Code:

Code: Select all

#include <xs1.h>

in  port RX  = XS1_PORT_1E;
out port TX  = XS1_PORT_1F;


int main (void) {
    TX <: 1;
    unsigned val = 1;
	while (1) {
	  RX when pinseq(val) :> void;
	}
	return 0;
}
Result: See attached file

Curiosity: the above code should not do anything, however, every time a change in value on RX occurs it seems to drop TX low for 10 nS. What is going on here?

Thanks - Dominik
Attachments
Strange2.png
Strange2.png (1.87 KiB) Viewed 4650 times
Strange2.png
Strange2.png (1.87 KiB) Viewed 4650 times
StrangeRising.png
StrangeRising.png (4.1 KiB) Viewed 4650 times
StrangeRising.png
StrangeRising.png (4.1 KiB) Viewed 4650 times
gothed
Junior Member
Posts: 7
Joined: Wed Nov 06, 2013 1:50 am

Post by gothed »

Thanks to the segher I now feel comfortable that the below code fairly classifies the performance of the XCORE architecture.

Code: Select all

#include <xs1.h>

out port TX  = XS1_PORT_1E;
in  port RX  = XS1_PORT_1F;

int main (void) {
    TX <: 1;
    unsigned val = 1;
	while (1) {
	  RX when pinseq(val) :> void;
	  TX <: val;
	  val = !val;
	}
	return 0;
}
The above code produces a typical response of 50nS. At this point, unless someone calls fault on this I would consider this a fair performance review of the XMOS architecture.

Now there was a paper published named "Benchmarkign I/O response speed of microprocessors" by Goncalo Martins, Andrew Stanford-Jason, and David Lacey. To quote form that paper:

"There are two approaches to this: run every device under test at the same clock frequency or normalize by dividing the result by the clock frequency."

OK, the idea is simple, we want to compare architecture and not CLK speed. It so happens that I conducted this same test on 8-Bit AVR using an interrupt implementation (interrupts have obvious advantages over polling, specifically I can respond to several external events without needing multiple cores.)

Preliminary results show that it takes the AVR 12 clock cycles to respond to the external event. At 20 MHz that would be 600nS. If we scale this to the frequency at which the XCORE ports are clocked (100 MHz) this would be equal to 120nS vs. the 50nS of the XMOS. If, however, we scale this to the XMOS core frequency (400 MHz) then it is equal to 30nS on the AVR vs. the 50nS on the XMOS.

Please tell me if I have made any mistake, as I want this to be a fair and honest discussion/comparison of XMOS capabilities vs. another core architecture. After reading the above cited paper, which indicates performance increases of the order of a factor of one thousand, I was shocked to find that the old AVR architecture, once scaled to frequency, would actually beat the XMOS core.

What are every-bodies thoughts?

Thanks - Dominik
Attachments
50nS.png
(26.69 KiB) Not downloaded yet
50nS.png
(26.69 KiB) Not downloaded yet
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am
Contact:

Post by segher »

You should be able to get down to 20ns. Your board might
have too long rise time; you can't tell from LA output.

Your processor probably runs at 500MHz, btw.

You are measuring only one aspect of I/O performance:
the latency of reacting from one signal to one other signal,
programmed in one particular way. There are many other
important dimensions. In the end, all that matters is if your
processor is fast enough for your application or not ;-)

Good luck finding a 500MHz AVR btw. For counting clock
cycles it is better to count thread cycles (at least 8ns; a
thread is scheduled at most every fourth processor cycle);
and the absolute latency is a more interesting number in
the end.

Cheers, Segher
User avatar
davelacey
Experienced Member
Posts: 104
Joined: Fri Dec 11, 2009 8:29 pm

Post by davelacey »

gothed wrote: Preliminary results show that it takes the AVR 12 clock cycles to respond to the external event. At 20 MHz that would be 600nS. If we scale this to the frequency at which the XCORE ports are clocked (100 MHz) this would be equal to 120nS vs. the 50nS of the XMOS. If, however, we scale this to the XMOS core frequency (400 MHz) then it is equal to 30nS on the AVR vs. the 50nS on the XMOS.

Please tell me if I have made any mistake, as I want this to be a fair and honest discussion/comparison of XMOS capabilities vs. another core architecture. After reading the above cited paper, which indicates performance increases of the order of a factor of one thousand, I was shocked to find that the old AVR architecture, once scaled to frequency, would actually beat the XMOS core.

What are every-bodies thoughts?

Thanks - Dominik
I don't think this is too surprising so your experiments are probably about right. The best case of taking an interrupt and then outputting straight away is going to be a few instructions on most architectures. If you scale them to frequency then the performance is going to be roughly the same on AVR, ARM, XMOS etc (thought I expect it would be a bit of a struggle to scale an AVR up to 500Mhz with as it currently is i.e. no superscalar etc. but I'm not a silicon engineer) . You can see that in the paper since the difference in the best cases for one event between the three systems is not huge. There are other things to take into account, in particular:

* What about saving and restoring the context (registers etc.) of the task you are interrupting?
* What happens when you have to react to multiple events?

These are the things that start to make a difference (and are typical on most real-time embedded systems). The first is particularly troublesome if you have a contended memory bus. The second tends to scale linearly on most architectures but is constant up to the number of logical cores you have for XMOS. Once you start adding these things you get more variation/non-determinism in timing (which is why we repeatedly ran the experiments and got the timing distribution).

By the way, if you are interested you can get the code we used here:

https://github.com/xmosopen/sw_io_benchmarks

You quote the AVR as 12 clock cycles. Are you measuring the time externally as well (e.g. with a scope)? We found that just counting instructions can be quite misleading if you are not careful.

Dave
Post Reply