Thread priority - is there any option?

Technical questions regarding the XTC tools and programming with XMOS.
Heater
Respected Member
Posts: 296
Joined: Thu Dec 10, 2009 10:33 pm

Post by Heater »

@paul,

Thank you, looks like I have a lot of fun experimenting to do.

@daveg,

I think the problem comes when you have one thread that needs maximum speed in order to keep up with some external hardware. Splitting that over multiple threads may not be possible.

So now if you need 4 other, not so demanding, threads you have a problem. The fifth thread added breaks the timing of the critical one.


mculibrk
Active Member
Posts: 38
Joined: Tue Jul 13, 2010 2:57 pm

Post by mculibrk »

This was exactly the reason I asked about "thread priority" - if would be very nice to have some option to set a thread as "high priority" telling the scheduler not to "steal" its cycles for other threads.
You could set maximum one (well, up to 3 on the other side) of such "high priority threads" and the remaining threads should share cycles in the same simple round-robin scheme.

I know... 100 persons 1001 wishes/ideas... :?

regards,
mculibrk
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am

Post by segher »

Do you have a real-life example where this is actually useful?
mculibrk
Active Member
Posts: 38
Joined: Tue Jul 13, 2010 2:57 pm

Post by mculibrk »

Well... right now at the project I'm working on where at least one (but there will be two) threads need all their "mips" available all the time (it's kind of display driver where timing matters - especially if "delays" get accumulated over time).
(8-12 channel soft pwm led driver)

Few other threads are used for fetching/preparing data for this "full-speed" threads and other "house keeping" and other, additional, low priority tasks.

don't get me wrong, I know it all depends on how one actually codes things but giving the "use a thread for anything" "xmos paradigm" the threads are "going out" quite fast and one you hit the 5th thread you can "kiss goodbye" the deterministic behavior of the thread scheduler.

There are always different approaches and solutions, no doubt, and it's almost always possible to find a solution. If not otherwise in a form of a bigger hammer. :D
Heater
Respected Member
Posts: 296
Joined: Thu Dec 10, 2009 10:33 pm

Post by Heater »

mculibrk,

As you may have read it was your question re: thread priorities that prompted me to post a question about a scheduler for the XMOS.

I have worked on a few embedded systems in the past where the model of programming looked like this:

1) One or a few things that need to be done ASAP now. Generally handled by high priority interrupts.

2) The rest of the app that may have multiple threads, basically not so time critical, just cruising around in the background.

3) On occasion we have had to go to two or more processors to handle the high speed parts because interrupts trip over themselves.

Mapping this onto the XMOS we get:

1) 1 to 3 hardware threads running the high speed real time stuff, basically taking the role of interrupts.

2) 1 hardware thread handling all the slower stuff.

3) Need more high speed stuff? Use a 4 core chip?

Now in see that:

a) More than 4 threads is out as we lose speed for 1) above.

b) In 2) we need a scheduler to coordinate those lower priority tasks we have designed for the system in the 4th thread.

c) We have 3 more cores, life is good!

What might be cool is if the PAR statement did not always eat a hardware thread but rather spin up another software thread on a hardware thread. Then all threading code would look the same. Threads
could be moved from core to core, from hardware thread to hardware thread as required.
mculibrk
Active Member
Posts: 38
Joined: Tue Jul 13, 2010 2:57 pm

Post by mculibrk »

Heater, I agree with you completely.
Heater wrote:What might be cool is if the PAR statement did not always eat a hardware thread but rather spin up another software thread on a hardware thread. Then all threading code would look the same. Threads
could be moved from core to core, from hardware thread to hardware thread as required.
That would be actually very nice. Maybe the simplest way of doing that would be just to implement the "thread priority" I mentioned earlier. Right, this is "easiest/simplest for me 8-) " but it requires some changes in the XCore which is not so "simple to expect". This way the code clarity and all the rest remains the same. It would be great if you need up to 8 threads as supported by hardware.

If you need more, well, "take more cores" :D
or use a soft-scheduler.

Anyway, using just he plain par would be somehow problematic - you still need to tell the scheduler at which priority to execute the thread (or to use hw or sw thread) ...

For the time being (until some XCore v2.0 pops out) I think a soft scheduler is the way to go...

but I'm still curious how a function could be defined to take a "function reference" as a parameter... :?
(to be able to call the thread start from XC)

regards,
mculibrk
Heater
Respected Member
Posts: 296
Joined: Thu Dec 10, 2009 10:33 pm

Post by Heater »

I'm no silicon designer but imaging "thread priority" in hardware is a lot more complicated.

As it stands now the xcore shovels code into a pipeline and eventually results come out the end.
It can start a new instruction every four cycles and the pipe is four stages long so that your maximum execution rate.
Add another thread and it takes another clock to get around.
A priority scheme would mess this process up a lot and I imagine it's complex enough already.
Anyway, using just he plain par would be somehow problematic
Yes there would need to be some syntax required to get PAR to create hard or soft threads. Then of course you need soft channels to go with those soft threads.

Anyway this is what was done in OCCAM. One had to use SEQ to specifies sequential code and PAR to indicate statements than ran in parallel. This was all soft threads on a single chip. Then there was PLACED PAR that put parallel statements onto different chips. On could move threads around from chip to chip by moving "placed".


As for function pointers, I guess you have to do your stuff in C and call some XC funcs to handle the threading and coms work.
- you still need to tell the scheduler at which priority to execute the thread (or to use hw or sw thread)
As far as I remember the OCCAM PAR had no concept of priority apart from PLACED that could move a process to it's own chip.
mculibrk
Active Member
Posts: 38
Joined: Tue Jul 13, 2010 2:57 pm

Post by mculibrk »

Heater wrote:I'm no silicon designer but imaging "thread priority" in hardware is a lot more complicated.

As it stands now the xcore shovels code into a pipeline and eventually results come out the end.
It can start a new instruction every four cycles and the pipe is four stages long so that your maximum execution rate.
Add another thread and it takes another clock to get around.
A priority scheme would mess this process up a lot and I imagine it's complex enough already.
I'm no silicon designer either but I doubt it's really that complicated. Giving the available info the core is using a "simple round robin scheduler" which just "skips" threads in a "waiting" state so it couldn't be so difficult/complicated to add the "priority" checking.
Anyway, using just he plain par would be somehow problematic
Yes there would need to be some syntax required to get PAR to create hard or soft threads. Then of course you need soft channels to go with those soft threads.
Yap, resources and especially channels would be a greater issue to "implement" in a soft-scheduler. The simplest way would be to just use "alternate functions" replacing normal channel usage but then again we end up with (very) different code syntax/looks/usage.... :(
As for function pointers, I guess you have to do your stuff in C and call some XC funcs to handle the threading and coms work.
yes, but it's sort of ugly again... and I think it would not be possible to call such "thread starting functions" from XC as I think there is no way to pass function pointers in xc...
- you still need to tell the scheduler at which priority to execute the thread (or to use hw or sw thread)
As far as I remember the OCCAM PAR had no concept of priority apart from PLACED that could move a process to it's own chip.
this whole "priority thing" is only related to the XCore thread scheduling where you have more than 4 threads active. I suppose in OCCAM each parallel task on a single core actually shared the cpu resources so it would be better to compare a single XCore with 4 "normal" cpus under OCCAM. But I may be wrong, badly wrong as I know nothing about OCCAM nor how it works or where it runs....
User avatar
lilltroll
XCore Expert
Posts: 956
Joined: Fri Dec 11, 2009 3:53 am
Location: Sweden, Eskilstuna

Post by lilltroll »

An example just using XC timers ans channels to control the scheduler.

Here is an example;

On one core I run 4 FIR filters that transfers the in/out data via the channel array c[]. I also have a sleeping thread get. The length of each filter is 512 taps.

Code: Select all

par {
				fir_dd(c[0], s[0], xs[0], size);
				fir_dd(c[1], s[1], xs[1], size);
				fir_dd(c[2], p[0], xp[0], size);
				fir_dd(c[3], p[1], xp[1], size);
				get(c_vec,ptr);
}
I made a run in the simulator together with Wave-viever:
When the get() is sleeping one period is 2702 cycles long between each new input to the filter.
Since the filter-length was 512 taps this can be translated to 5.28 cycles/FIRtap @512taps.
The scheduler is thus counting: 0 1 2 3 0 1 2 3 0 1 2 3

After a while the get() wakes up since the Ethernet thread is requesting the current filter-taps values.
To limit the burden on the fir_dd threads, the get() wakes up once every microsecond and sends a 32bit value from the filter-coeff. array - e.g. it delivers 32 Mbit/s of data to the Ethernet thread.

The scheduler is now counting: 0 1 2 3 ........ 0 1 2 3 0 1 2 3 4..0 1 2 3 4 0 1 2 3 .......... 0 1 2 3
On period of a FIR filter now becomes 2786 cycles long which can be translated to 5.44 cycles/FIRtap @512taps.

The "slowdown factor" is less than 3% when get is awake and the solution is also very deterministic from the FIR filters perspective, thus I can give the get() a deterministic burden in such a way that I can guarantee the FIR filters to not be starved out of time.

PS. The FIR filter was done with FNOP reduction + counting backwards to 0 in the loops (avoiding the eq instruction) to achieve <6 cycles per tap DS.
Probably not the most confused programmer anymore on the XCORE forum.
User avatar
segher
XCore Expert
Posts: 844
Joined: Sun Jul 11, 2010 1:31 am

Post by segher »

Hi lilltroll,
lilltroll wrote:PS. The FIR filter was done with FNOP reduction
What's that?
+ counting backwards to 0 in the loops (avoiding the eq instruction) to achieve <6 cycles per tap DS.
Another trick you can do is count forward, but offset so that you count from -N to 0 instead
of from 0 to N (and offset your pointers etc. the same way, of course).

And you might want to unroll the loop, of course.