500 MIPS event-driven RISC processor + 100% deterministic

All the latest news and announcements from XCore and XMOS.
User avatar
jason
XCore Expert
Posts: 577
Joined: Tue Sep 08, 2009 5:15 pm

500 MIPS event-driven RISC processor + 100% deterministic

Postby jason » Wed May 26, 2010 3:19 pm

This report details why XMOS is a serious candidate to replace low cost FPGAs.

FPGAs and CPLDs are used in many industries covering a broad range of performance requirements, price points and power envelopes. In the early days, FPGAs were used for prototyping ASICs and for high-end, low volume applications that could bear a high unit cost, such as the communications and defence sectors. Since then, FPGA vendors have driven down costs and power, through rapid process migration, to produce new lower cost and lower power device families to address new requirements.

Now for the first time there is an all-digital flexible solution that will prove to be a better, cheaper, easier and lower power solution than an FPGA for many applications—XMOS.

Read the report to learn more: http://bit.ly/fpgavsxmos

PS check out this thread and join the discussion about XMOS vs FPGA's here
Heater
Respected Member
Posts: 296
Joined: Thu Dec 10, 2009 10:33 pm

Postby Heater » Thu May 27, 2010 5:23 am

From the paper:

"a 500 MIPS event-driven RISC processor with 100% deterministic operation"

I always thinks this statement about XMOS chips needs a little qualification. Because as far as I can tell it is not exactly true as stated.

1) If I write some code intended too run as a thread on a xcore then it's speed of execution will depend on how many threads that xcore is running. So if you want to use my object in your project then it's performance depends on what else is going on in your code. So the timing of my object is not deterministic.

2) More insidious is that the use of MUL and DIV takes more time than the other instructions and will stall other threads whilst they execute. So the execution flow of the object I provide for you is now suffering random jitter from the other code running in your projects threads. So the timing of my object is not deterministic.

To achieve timing determinism one can not simply rely on cycle counting the execution of code in the processor one must make use of the timing facilities available in the xcore. Presumably in tight cases even that can break if a code path in my object is subject to stalls from MUL DIV executing in the application.

Alternatively determinism is achieved by insisting that the object I supply lives alone in it's own xcore. This rather a lot to ask for some small functions.

Bottom line is that the timing determinism of the threads on an xcore is not quite as deterministic as having those threads running on independent processors as you might have in some other architectures.

David May and I had a little chat about this on this forum over the New Year break and I think he basically agreed. Can't find the thread now.

Makes me wonder how the timing analysis tool handles this. I have yet to try it out.
User avatar
snowman
Member
Posts: 13
Joined: Fri Dec 11, 2009 10:51 am

Postby snowman » Thu May 27, 2010 9:52 am

"More insidious is that the use of MUL and DIV takes more time than the other instructions and will stall other threads whilst they execute. So the execution flow of the object I provide for you is now suffering random jitter from the other code running in your projects threads. So the timing of my object is not deterministic."

I should correct this slightly:

1) MUL does not take more time

2) DIV does not affect other threads

Try the functional simulator for example:

$ cat a.xc
#define DIV asm("divs r11, r11, r11")
#define MUL asm("mul r11, r11, r11")
int main()
{
par {
{ DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; }
{ DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; }
{ DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; }
{ DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; }
{ DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; }
{ DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; }
{ DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; DIV; }
{ MUL; MUL; MUL; MUL; MUL; MUL; MUL; MUL; MUL; MUL; MUL; MUL; MUL; MUL; MUL; MUL; }
}
return 0;
}

$ xcc -target=XK-1 a.xc && xsim -t a.xe | grep mul | grep -o '@[0-9]\+$'
@1576
@1580
@1584
@1588
@1592
@1596
@1600
@1604
@1608
@1612
@1616
@1620
@1624
@1628
@1632
@1636

In the functional simulator, each MUL instruction takes 10ns to execute (4 x 2.5ns) despite 7 other threads executing DIV instructions
Heater
Respected Member
Posts: 296
Joined: Thu Dec 10, 2009 10:33 pm

Postby Heater » Thu May 27, 2010 10:30 am

Interesting.

That is somewhat contrary to the documentation and David's statements during our discussion about this some while back.

I have to dig out my references now I guess.
User avatar
lilltroll
XCore Expert
Posts: 955
Joined: Fri Dec 11, 2009 3:53 am
Location: Sweden, Eskilstuna

Postby lilltroll » Thu May 27, 2010 6:39 pm

Maybe you mean the long Div instruction that also computes the reminder so you can compute DIV with arbitrary precision. As I understood the Long DIV is shared between all threads and also may need several instructions to calculate the result !?
Probably not the most confused programmer anymore on the XCORE forum.
richard
Respected Member
Posts: 318
Joined: Tue Dec 15, 2009 12:46 am

Postby richard » Thu May 27, 2010 8:10 pm

There a one divide unit per core that is shared between all threads in that core. The following instructions use the divide unit:
  • divu
  • divs
  • remu
  • rems
  • ldivu
If multiple threads simultaneously execute instructions from this list then one will thread acquire the divide unit and others wanting to use the divide unit must wait until it becomes free.

These are the only instructions where there is inter-thread interference. The macc instruction executes in a single thread cycle regardless of what is happening on the other threads.
Last edited by richard on Thu May 27, 2010 10:27 pm, edited 2 times in total.
richard
Respected Member
Posts: 318
Joined: Tue Dec 15, 2009 12:46 am

Postby richard » Thu May 27, 2010 9:21 pm

Heater wrote:1) If I write some code intended too run as a thread on a xcore then it's speed of execution will depend on how many threads that xcore is running. So if you want to use my object in your project then it's performance depends on what else is going on in your code. So the timing of my object is not deterministic.
Heater wrote:To achieve timing determinism one can not simply rely on cycle counting the execution of code in the processor one must make use of the timing facilities available in the xcore. Presumably in tight cases even that can break if a code path in my object is subject to stalls from MUL DIV executing in the application.
For a given program you can calculate the maximum number of threads allocated at one time (the tools will do this for you). With this information it is very easy to reason about the worse case timing without knowing what the other threads are executing at the time. You can do this by assuming each divide takes the maximum possible time (32 thread cycles) and by assuming every other allocated thread is always running.

If the code uses the facilities the XCore provides (clock blocks / timers) to time I/O then the worse case timing information should be enough to ensure the code always meets timing requirements. If the code runs faster it will behave correctly as the I/O instructions will pause until the I/O needs to take place.

I agree if you want to time I/O by counting instruction cycles then worst case timing information is insufficient. To me counting instruction cycles seems a fragile programming style as every time you make a change you must carefully rearrange / add /remove instructions so things fall on the correct cycle. That said I believe there is a way to write code this way on a XCore if you truly desire. When a thread is placed in fast mode it reserves a pipeline slot even when paused. If all threads are placed in fast mode the execution speed of each thread is fixed regardless of whether other threads are paused or not. Note that threads in fast mode may burn more power and other threads will no longer speed up when the thread is paused (which might be bad if you have any threads where the average throughput is important).
Heater
Respected Member
Posts: 296
Joined: Thu Dec 10, 2009 10:33 pm

Postby Heater » Fri May 28, 2010 5:36 am

richard: "For a given program you can calculate the maximum...."

Here in lies my issue. There is no given program. Let's see if I can express it clearly.

Let's say I write a device driver for some gadget. I want it to run as fast as possible. It might have to be fast to satisfy the gadgets hardware requirements and/or it might have very strict timing requirements. Let's say it uses just one or a few threads. I want to post my code as a project on the xcore exchange for anyone to use. I want them to be able to just drop it in to whatever "program" they have as a reusable object. That is it has to coexist with a bunch of other unknown (to me) objects running on other threads. The user of my object should only have to know about the channel interface I specify for my gadget driver. The user of my object should absolutely should not have to get into doing timing analysis of my code.

How can I be sure my gadget driver will work for everyone in all possible programs?

1) I have to assume the slowest processor is used 400Mhz.
2) I have to assume all threads on the core will be in use all the time.
3) I have to assume those other threads are full of div, rem, ldiv etc.
4) I probably have to assume the worst case optimization level set by the user when compiling.

Now, I think I agree that timing programs by cycle counting is probably fragile and unmaintainable. If nothing else it is a pin to create in the first place unless you do it in assembler on a machine with a very simple/regular instruction set, preferably with all instructions taking the same number of clocks.

However, even when using all the hardware timing facilities available in the xcore one has to make the assumptions I listed above to arrive at the maximum performance that will work when that code is thrown in with a bunch of other objects about which one knows nothing and may not even exist yet.

This might seem like nit picking but if there is to be a community of xcore object creators putting up objects for general use they have to take this into account.

Bottom line is that the timing of my object is not 100% deterministic, at least not determinable by me. It may be so in the complete program but I will probably never see that.

P.S. What is "fast mode" I seem to have bypassed that whilst wading through the documentation.
User avatar
davelacey
Experienced Member
Posts: 104
Joined: Fri Dec 11, 2009 8:29 pm

Postby davelacey » Fri May 28, 2010 8:15 am

Generally, given those assumptions (I have comments on them later) if timing is OK in one program it should be ok for all. However, a reasonable approach is to distribute a timing script with the component. This method will give a pass/fail when the user builds their final application. So the user is doing the timing but it requires no user know-how - it is more of a sanity/safety check.

Regarding your assumptions. I think they have to be taken into consideration but are not too restrictive:

1) I have to assume the slowest processor is used 400Mhz.

Yes (unless documented otherwise).

2) I have to assume all threads on the core will be in use all the time.

Yes (unless documented otherwise).

3) I have to assume those other threads are full of div, rem, ldiv etc.

Yes. This is the most annoying one if your component needs div in real time dependent code (experience suggests this is pretty rare though - AFAIK none of our current components/examples need this).

4) I probably have to assume the worst case optimization level set by the user when compiling.

This is going to be too restrictive. For binary distributed components the optimization level is pre-set.
I think source code components have to stipulate their own optimization level.
If you look at the example code on xmos.com, it uses a set of makefiles where each component has the ability to override the compiler flags for its own source separate to the compile flags for the application using it.
User avatar
leon_heller
XCore Expert
Posts: 546
Joined: Thu Dec 10, 2009 10:41 pm
Location: St. Leonards-on-Sea, E. Sussex, UK.

Postby leon_heller » Fri May 28, 2010 8:42 am

"Fast mode" is described in xs1_en.pdf.

Who is online

Users browsing this forum: No registered users and 0 guests