Fastest possible bit mapping/manipulation/compression

mculibrk · Post by **mculibrk** » Fri Apr 08, 2011 7:05 pm

Time to crunch some numbers...

I need to "map" 32 bytes to 32 bits as fast as possible following this logic:

input:
- there is an array of 32 bytes (or 16 short ints or 8 unsigned ints)
- there is a Threshold value

output:
- a single unsigned int (32 bits) where each bit represent if the array[element] > Threshold

sort of:

Code: Select all

unsigned char ARR[32];
int index;
unsigned char THR;
unsigned int VAL;

VAL = 0;
for (index=0; index < 32; index++) {
    if (ARR[index] > THR)
        VAL = VAL | (1 << index);
}

example:

THR = 100
ARR = { 10, 20, 121, 200, ...., 150,  101, 0, 200, 1  }
VAL = 0b01011.......................1100

this should be calculated in under 1 microsecond... :o

mission impossible???

Any ideas?

best regards,
mculibrk

mculibrk · Post by **mculibrk** » Fri Apr 08, 2011 11:17 pm

I think I found a possible solution but I really need some help for constructing a functioning inline asm code as I'm not very good with inline asm...

the solution would be something similar:

Code: Select all


unsigned Base;
unsigned Thr;

LD        regB, Base                     ; load value of Base to regB
MKMSKI    regM, 24                          ; regM = 0x00FFFFFF
LD        regT, Thr                         ; put value of Thr in regT
SHLI      regT, regT, 24                ; Thr <<= 24
OR        regT, regT, regM           ; Thr |= 0x00FFFFFF
OR        regM, regT, regT            ; save regT for later in regM

--- a
LDWI      regV, regB, 0                  ;  regV = array[base + 0]
LSU       regR, regV, regT            ; regV = (regV < regT)
OR        regX, regR, 0                  ; X |= (regV < regT)
SHLI      regX, regX, 1                  ; X <<= 1
SHRI     regT, regT, 8                  ; regT >>= 8   (Thr = 0x00"Thr"FFFF)
ZEXTI     regV, regV, 24                ; regV &= 0x00FFFFFF
OR         regT, regM, regM           ; restore regT
--- b

repeat  block from --- a to --- b  7 more times

result of compressed 32 byte --> 32 bit  should be in regX

Some help would be greatly appreciated.

Best regards,
mculibrk

Post by **jonathan** » Sat Apr 09, 2011 12:15 am

Hi mculibrk,

Firstly, quick request. Please only post your question once. If people can't help you then they won't, it doesn't matter how many times you post or link to your own post. If people can and want to help you, they will, but they are more likely to if you only post it once!

Secondly, this is not an official XMOS support channel (as far as I know) - this is a forum of interested users, some of whom happen to be XMOS staff. What kind of project are you working on? Is it a commercial project that needs resolving ASAP? Have you sent a ticket into XMOS? If so, and you have not received a response, what is your deadline?

I don't speak for the community, but I am sure if you bear with us, people here will try to answer your questions if they possibly can.

Best wishes,

Jonathan

mculibrk · Post by **mculibrk** » Sat Apr 09, 2011 1:03 am

I'm really sorry for the mess I did....

because its asm and c/xc related and "I'm in a hurry" I decided to post the question in both threads because I do not know where the questions fits best. I'll avoid any similar "cross/double postings" in the future. I'm really sorry for the inconvenience.

I know he people here are really great and responsive - as I already experienced.

Well, this "deadline" is not a real one... I'm working on a project (sw pwm led driver) I should "demonstrate" shortly. It's not any commercial stuff anyway. At least not yet. :)
I did not send any tickets to anyone as this is just a "programming/knowledge problem"...

Again sorry for any inconvenience.

Best regards,

mculibrk

segher · Post by **segher** » Sat Apr 09, 2011 6:12 am

You want to compute a mask of 32 comparisons, 1M times per second. Assuming you
run the core at 400MHz (the default), and you are using only one thread for this, that
means you have room for only three instructions per bit you construct. That is without
considering how you get all that data in; perhaps via another thread.

Is the threshold a constant? If not, is it a constant for many consecutive runs? What
is the actual maximum value for the threshold (can it be >127)?

richard · Post by **richard** » Sat Apr 09, 2011 2:09 pm

mculibrk wrote:I think I found a possible solution but I really need some help for constructing a functioning inline asm code as I'm not very good with inline asm...

the solution would be something similar:

Code: Select all


unsigned Base;
unsigned Thr;

LD        regB, Base                     ; load value of Base to regB
MKMSKI    regM, 24                          ; regM = 0x00FFFFFF
LD        regT, Thr                         ; put value of Thr in regT
SHLI      regT, regT, 24                ; Thr <<= 24
OR        regT, regT, regM           ; Thr |= 0x00FFFFFF
OR        regM, regT, regT            ; save regT for later in regM

--- a
LDWI      regV, regB, 0                  ;  regV = array[base + 0]
LSU       regR, regV, regT            ; regV = (regV < regT)
OR        regX, regR, 0                  ; X |= (regV < regT)
SHLI      regX, regX, 1                  ; X <<= 1
SHRI     regT, regT, 8                  ; regT >>= 8   (Thr = 0x00"Thr"FFFF)
ZEXTI     regV, regV, 24                ; regV &= 0x00FFFFFF
OR         regT, regM, regM           ; restore regT
--- b

repeat  block from --- a to --- b  7 more times

result of compressed 32 byte --> 32 bit  should be in regX

Some help would be greatly appreciated.

Best regards,
mculibrk

Are you sure this is equivalent to your original code? If you repeat block from a to b 8 times then only end up 8 bits of regX will be set, but the original code returns a 32bit result.

richard · Post by **richard** » Sat Apr 09, 2011 2:23 pm

This is the best I can come up with:

Code: Select all

#include <print.h>
#include <xclib.h>

// Returns a + (b << 1)
static inline unsigned lda16f(unsigned a, unsigned b)
{
  unsigned val;
  asm("lda16 %0, %1[%2]" : "=r"(val) : "r"(a), "r"(b));
  return val;
}

unsigned f(unsigned threshold, unsigned a[8]) {
  unsigned val;
  unsigned cmp = (threshold << 24) | 0x00ffffff;
  unsigned tmp;
  
  tmp = a[7];
  val = tmp > cmp;
  tmp <<= 8;
  val = lda16f(tmp > cmp, val);
  tmp <<= 8;
  val = lda16f(tmp > cmp, val);
  tmp <<= 8;
  val = lda16f(tmp > cmp, val);

  tmp = a[6];
  val = lda16f(tmp > cmp, val);
  tmp <<= 8;
  val = lda16f(tmp > cmp, val);
  tmp <<= 8;
  val = lda16f(tmp > cmp, val);
  tmp <<= 8;
  val = lda16f(tmp > cmp, val);

  tmp = a[5];
  val = lda16f(tmp > cmp, val);
  tmp <<= 8;
  val = lda16f(tmp > cmp, val);
  tmp <<= 8;
  val = lda16f(tmp > cmp, val);
  tmp <<= 8;
  val = lda16f(tmp > cmp, val);

  tmp = a[4];
  val = lda16f(tmp > cmp, val);
  tmp <<= 8;
  val = lda16f(tmp > cmp, val);
  tmp <<= 8;
  val = lda16f(tmp > cmp, val);
  tmp <<= 8;
  val = lda16f(tmp > cmp, val);

  tmp = a[3];
  val = lda16f(tmp > cmp, val);
  tmp <<= 8;
  val = lda16f(tmp > cmp, val);
  tmp <<= 8;
  val = lda16f(tmp > cmp, val);
  tmp <<= 8;
  val = lda16f(tmp > cmp, val);

  tmp = a[2];
  val = lda16f(tmp > cmp, val);
  tmp <<= 8;
  val = lda16f(tmp > cmp, val);
  tmp <<= 8;
  val = lda16f(tmp > cmp, val);
  tmp <<= 8;
  val = lda16f(tmp > cmp, val);

  tmp = a[1];
  val = lda16f(tmp > cmp, val);
  tmp <<= 8;
  val = lda16f(tmp > cmp, val);
  tmp <<= 8;
  val = lda16f(tmp > cmp, val);
  tmp <<= 8;
  val = lda16f(tmp > cmp, val);

  tmp = a[0];
  val = lda16f(tmp > cmp, val);
  tmp <<= 8;
  val = lda16f(tmp > cmp, val);
  tmp <<= 8;
  val = lda16f(tmp > cmp, val);
  tmp <<= 8;
  val = lda16f(tmp > cmp, val);

  return bitrev(val);
}

int main()
{
  unsigned char a[32] = {
    10, 20, 121, 200, 82, 122, 7, 230,
    164, 150, 9, 190, 187, 255, 0, 66,
    101, 0, 200, 1, 89, 10, 20, 30,
    99, 254, 3, 100, 206, 32, 64, 99
  };
  unsigned threshold = 100;
  unsigned result = f(threshold, (a, unsigned[]));
  printhexln(result);
  return 0;
}

When built with -O2 -fschedule the code takes 400 processor clock cycles assuming no more than 4 threads are running. This is 1000ns if the processor is running at 400Mhz and 800ns at 500MHz.

segher · Post by **segher** » Sat Apr 09, 2011 9:08 pm

Assuming the values are bytes, and the values are <128, and thresholds are <127,
you can do something like this:

Code: Select all

u32 get4bits(u8 *a, u8 thres)
{
  u32 x;
  // get four values
  asm("ldw %0,0[%1]" : "=r"(x) : "r"(a));

  // put guard bits in
  x |= 0x80808080;

  u32 thres4 = thres+1;
  thres |= thres << 8;
  thres <<= thres << 16;

  // subtract; the guard bits will stay 1 iff the value was >thres
  x -= thres4;

  // now move the bits into place
  x &= 0x80808080;
  x |= x >> 7;
  x |= x >> 14;

  return x & 15;
}

which is (properly unrolled, etc.) 7 cycles for 4 values. You can do slightly better by
doing more than 4 bytes at once, and/or using 64-bit adds.

Oh, this is C code, making it valid XC is left as an exercise to the reader :-P

mculibrk · Post by **mculibrk** » Sat Apr 09, 2011 10:16 pm

First of all I would like to say A BIG THANKS to both you segher and richard for all your help, ideas and solutions!
YOU ARE GREAT! THANKS!!!

segher wrote:You want to compute a mask of 32 comparisons, 1M times per second. Assuming you
run the core at 400MHz (the default), and you are using only one thread for this, that
means you have room for only three instructions per bit you construct. That is without

Yeah... that's pretty much exactly as you stated. Hmm.... I obviously did a really bad math prior to starting this project... :(

considering how you get all that data in; perhaps via another thread.

Is the threshold a constant? If not, is it a constant for many consecutive runs? What
is the actual maximum value for the threshold (can it be >127)?

Well, the "function" of the code is actually a multi-channel software PWM, so yes, there are many consecutive runs with changing threshold (each run has a constant threshold).
The original "wish" was to have 8-bit pwm (256 levels) but depending on the possibilities/capabilities and settings that could easily be limited to 7 or less bits.

I'm really impressed with the ideas/solutions you propose. I tried various procedures to calculate that mask and analysed the outcome with XTA with best result of ~2000 nS without using any asm help.
After that I "switched" to asm trying to find a better solution.

Well, it was rather late when I wrote that asm down... after quite some time thinking/working on it after a few days of almost no sleep.... so I'm sure I wrote/made a lot of mistakes in the snipets I showed...

Again, a BIG THANKS for your help, guys!

best regards,
m.culibrk

mculibrk · Post by **mculibrk** » Sat Apr 09, 2011 10:50 pm

Mmmm.... very interesting solution... and very, very fast!
I was thinking of some "match trick" solution like yours but "blindly" went for 8-bit values/threshold so I did not find any elegant way of doing it (without using some guards like you did).

segher wrote:Assuming the values are bytes, and the values are <128, and thresholds are <127,
you can do something like this:
Code: Select all
u32 get4bits(u8 *a, u8 thres)
{
  u32 x;
  // get four values
  asm("ldw %0,0[%1]" : "=r"(x) : "r"(a));

  // put guard bits in
  x |= 0x80808080;

  u32 thres4 = thres+1;
  thres |= thres << 8;
  thres <<= thres << 16;

  // subtract; the guard bits will stay 1 iff the value was >thres
  x -= thres4;

  // now move the bits into place
  x &= 0x80808080;
  x |= x >> 7;
  x |= x >> 14;

  return x & 15;
}
which is (properly unrolled, etc.) 7 cycles for 4 values. You can do slightly better by
doing more than 4 bytes at once, and/or using 64-bit adds.

Oh, this is C code, making it valid XC is left as an exercise to the reader :-P

but... how is it just 7 cycles? if you think just calculating the mask then yes... but I need to "pack" the bits at the end and that is a penalty (again)...
But, after a thought, maybe I do not need to "repack" the bits at the end. I could just do

Code: Select all

instead of:

  // now move the bits into place
  x &= 0x80808080;
  x |= x >> 7;
  x |= x >> 14;

just:
  res = x;
  res >>= 1

and use the res after "unrolling" the function 7 more times.

As I mentioned, this is a 4-channel pwm output - using a 4-bit output port - so a nibble in the result represents the all 4 channels state. I can have the source (the input array) ordered in any way so instead of having:

byte array [] = { ch0v0, ch1v0, ch2v0, ch3v0, ch0v1, ch1v1.... }

I could do

byte array [] = { ch0v0, ch0v1, ch0v2, ch0v3...ch0v7, ch1v0, ch1v1.... }

so at the and I still get nibbles filled with different channel values.

Again, THANKS to both of you!

regards,
mculibrk

Fastest possible bit mapping/manipulation/compression

Fastest possible bit mapping/manipulation/compression

Re: Fastest possible bit mapping/manipulation/compression

Re: Fastest possible bit mapping/manipulation/compression

Re: Fastest possible bit mapping/manipulation/compression

Re: Fastest possible bit mapping/manipulation/compression

Re: Fastest possible bit mapping/manipulation/compression

Re: Fastest possible bit mapping/manipulation/compression

Re: Fastest possible bit mapping/manipulation/compression

Re: Fastest possible bit mapping/manipulation/compression

Re: Fastest possible bit mapping/manipulation/compression