Problem with parallel tasks execution

If you have a simple question and just want an answer.
psebastiani
Member++
Posts: 29
Joined: Wed Oct 02, 2013 4:20 pm

Problem with parallel tasks execution

Post by psebastiani »

Dear all
I have an xCore-200 Explorer board and I have a problem with effective parallel execution.
I have essentially 2 thread; One generate data over streaming channel, and another elaborate this data.
The data stream are 32bit wide and I want elaborate separately and simultaneusly the first 16bit and the last 16bit, then build the 32bit data again.
The out port TT are used to verify externally the effective time execution of the elabora() function with an oscilloscope.

Code: Select all

//Produce test thread
void ProduceThread(streaming chanend c_out){
    int i, j;
    for (i=0; i<10; i++) {
        c_out <: (i + (i<<16));
        Wait_us(5);
    }
}

//Elabora test function
unsigned int elabora(unsigned int DataIn, unsigned int h[]){
    unsigned int i, DataOut;
    DataOut = 0;
    for (i=0; i<10; i++) {
        DataOut = DataOut + DataIn*h[i];
    }
    return DataOut;
}

//Elaborate test thread
void ElaborateThread(streaming chanend c_in, streaming chanend c_out, unsigned int Array[], out port TT){
    unsigned int I,Q, InData, OutI, OutQ;
    unsigned int i;
    unsigned int hQ[100], hI[100];
    for (i=0; i<100; i++) {
        hI[i] = Array[i];
        hQ[i] = Array[i];
    }

    for (i=0; i<10; i++) {
        c_in :> InData;
        //split data into I and Q
        I =InData >> 16;
        Q =InData && 0xFFFF;
        TT <: 1;
#if Q_ENABLE
        par {
            OutI = elabora(I, hI);
            OutQ = elabora(Q, hQ);
        }
#else
        OutI = elabora(Q, hI);
        OutQ = 0;
#endif
        TT <: 0;
        //build 32 bit data
        c_out <: ((OutI & 0xFFFF) << 16)  + (OutQ & 0xFFFF);
    }
}

int main(void) {
    streaming chan c, cOut32;
    
    on tile[0]: {
        ProduceThread(c);
    }
    
    on tile[0]: {     
        ElaborateThread(c, cOut32, TestArray, TT2);
    } 

    //Other thread with cOut32 channel input
}
The problem is around the par{} statement inside ElaborateThread().
If Q_ENABLE=0 the execution time of elabora() function is about 850ns.
If Q_ENABLE=1 the execution time of elabora() function is about 1500ns.
If Q_ENABLE=1 sems that the code inside par{} statement are executed sequentially!
If I remove the par{} statement as below, I obtain the same result.

Code: Select all

...
//par {
    OutI = elabora(I, hI);
    OutQ = elabora(Q, hQ);
//}
...
Why that? The elabora() functions inside par{} statement works with different data (I intentionally duplicate "Array" in "hI" and "hQ") to avoid concurrency in memory access.
Also, when I compile the code with Q_ENABLE=0 the resource occupation are 2 thread, with Q_ENABLE=1 the resource occupation are 3 thread.
The code is executed into splitted thread by sequentially.
Any suggestion? I want to execute the tread in parallel.
Thanks


User avatar
mon2
XCore Legend
Posts: 1913
Joined: Thu Jun 10, 2010 11:43 am
Contact:

Post by mon2 »

Label the test function elabora1.

Duplicate this function and call it elabora2.

Then call each inside your par construct.

Please post your results.
psebastiani
Member++
Posts: 29
Joined: Wed Oct 02, 2013 4:20 pm

Post by psebastiani »

Hi mon2,
Thanks you for the suggestion, but it don't work. I have the same behavior, add a thread but are executed "serially".
I forgot to tell that i use also par{} statement in the main function.
I don't know if this kind of recursive par{} can create any malfunctions but for the compiler is also ok.
The correct code now with your suggestion is:

Code: Select all

//Produce test thread
void ProduceThread(streaming chanend c_out){
    int i, j;
    for (i=0; i<10; i++) {
        c_out <: (i + (i<<16));
        Wait_us(5);
    }
}

//Elabora test function
unsigned int elabora1(unsigned int DataIn, unsigned int h[]){
    unsigned int i, DataOut;
    DataOut = 0;
    for (i=0; i<10; i++) {
        DataOut = DataOut + DataIn*h[i];
    }
    return DataOut;
}

//Elabora test function
unsigned int elabora2(unsigned int DataIn, unsigned int h[]){
    unsigned int i, DataOut;
    DataOut = 0;
    for (i=0; i<10; i++) {
        DataOut = DataOut + DataIn*h[i];
    }
    return DataOut;
}

//Elaborate test thread
void ElaborateThread(streaming chanend c_in, streaming chanend c_out, unsigned int Array[], out port TT){
    unsigned int I,Q, InData, OutI, OutQ;
    unsigned int i;
    unsigned int hQ[100], hI[100];
    for (i=0; i<100; i++) {
        hI[i] = Array[i];
        hQ[i] = Array[i];
    }

    for (i=0; i<10; i++) {
        c_in :> InData;
        //split data into I and Q
        I =InData >> 16;
        Q =InData && 0xFFFF;
        TT <: 1;
#if Q_ENABLE
        par {
            OutI = elabora1(I, hI);
            OutQ = elabora2(Q, hQ);
        }
#else
        OutI = elabora1(Q, hI);
        OutQ = 0;
#endif
        TT <: 0;
        //build 32 bit data
        c_out <: ((OutI & 0xFFFF) << 16)  + (OutQ & 0xFFFF);
    }
}

int main(void) {
    streaming chan c, cOut32;
    par {
      on tile[0]: {
          ProduceThread(c);
      }
      
      on tile[0]: {     
          ElaborateThread(c, cOut32, TestArray, TT2);
      }

       //Other thread with cOut32 channel input
     } //par 
}
Last edited by psebastiani on Mon Jun 29, 2020 6:40 am, edited 1 time in total.
User avatar
akp
XCore Expert
Posts: 578
Joined: Thu Nov 26, 2015 11:47 pm

Post by akp »

It seems to me that elabora() is a function, not a thread. I think if you ran a second core that just did the I or Q processing it would work. Maybe with an input chan and output chan, and you can do the other processing in the ElaborateThread

So when you get an InData then put the in Q to a channel to your Q thread, do the I processing in the ElaborateThread, and wait on a channel to get the output from the Q thread and recombine the Iout and Qout processed data to the c_out. There are undoubtedly more elegant ways to do it but I think this would work.
psebastiani
Member++
Posts: 29
Joined: Wed Oct 02, 2013 4:20 pm

Post by psebastiani »

Yes, the elabora1() and elabora2() are a functions, but are called inside par{} statement abd they should be executed simultaneusly into different thread.
I also know that I can split ElaborateThread() into separate thread and join them with channel, but the code above is a simple extrapolation of a more complex application, and I have no more thread avaiable.
I need basically 2 thread:
1) Acquire data and split in I and Q, execute elabora1(), reconbine I and Q.
2) execute elabora2().
elabora1() and elabora2() must be executed at the same time because are time-consuming function.
User avatar
akp
XCore Expert
Posts: 578
Joined: Thu Nov 26, 2015 11:47 pm

Post by akp »

So you're saying you need two threads but you have no more threads available? I'm sorry, but that doesn't seem like it will work.
psebastiani
Member++
Posts: 29
Joined: Wed Oct 02, 2013 4:20 pm

Post by psebastiani »

I'ts no correct. I want to execute into 2 thread (no more) this code:

Code: Select all

//Elaborate test thread
void ElaborateThread(streaming chanend c_in, streaming chanend c_out, unsigned int Array[], out port TT){
    unsigned int I,Q, InData, OutI, OutQ;
    unsigned int i;
    unsigned int hQ[100], hI[100];
    for (i=0; i<100; i++) {
        hI[i] = Array[i];
        hQ[i] = Array[i];
    }

    for (i=0; i<10; i++) {
        c_in :> InData;
        //split data into I and Q
        I =InData >> 16;
        Q =InData && 0xFFFF;
        TT <: 1;
#if Q_ENABLE
        par {
            OutI = elabora1(I, hI);
            OutQ = elabora2(Q, hQ);
        }
#else
        OutI = elabora1(Q, hI);
        OutQ = 0;
#endif
        TT <: 0;
        //build 32 bit data
        c_out <: ((OutI & 0xFFFF) << 16)  + (OutQ & 0xFFFF);
    }
}
and the function elabora1() and elabora2() must be executed in parallel.
The compiler, now, compile this into 2 threads, but elabora1() and elabora2() are executed sequentially. This is no good for me.
User avatar
akp
XCore Expert
Posts: 578
Joined: Thu Nov 26, 2015 11:47 pm

Post by akp »

Can you specify your problem better? Obviously if you got elabora1() and elabora2() to execute in parallel that would consume two threads. So what is the problem with simply having ElaborateThread as two threads? You'll consume two threads either way against your limit of eight threads on the tile.
psebastiani
Member++
Posts: 29
Joined: Wed Oct 02, 2013 4:20 pm

Post by psebastiani »

I try to explain my problem:
I need a block that acquire 32bit stream data through channel, and return also 32bit data through channel.
I have 2 thread avaiable to do this.
This block must separate the data into upper and lower 16bit, elaborate separately [elabora1() and elabora2()] then join again into a 32bit word. The time consuming function are elabora1() and elabora2(), and there sould be executed in parallel otherwise I lost samples.
The executing time of elabora1() + elabora2() in sequence are greather than the sample rate of the stream data.
I solved it using 3 or 4 thread but I need tho solve with 2 thread.
User avatar
akp
XCore Expert
Posts: 578
Joined: Thu Nov 26, 2015 11:47 pm

Post by akp »

So ElaborateThread needs to run its main loop in 5 usec? 200 kHz rate?

I think it will be faster if you called elabora1() and elabora2() in sequence rather than running them in a par, try that. I suspect there is too much overhead starting and stopping the two threads every time you want to do the computation. e.g.

Code: Select all

#if Q_ENABLE
        OutI = elabora1(I, hI);
        OutQ = elabora2(Q, hQ);
#else
        OutI = elabora1(Q, hI);
        OutQ = 0;
#endif
It still might not be fast enough but if it's faster with the above edit then you know the problem is in the overhead of setup and teardown of the tasks, so starting the two tasks once (e.g. one task for I and the other for Q) -- at boot time -- and keeping them up will be faster.
Post Reply