Wishlist for architecture

segher · Post by **segher** » Sat Nov 06, 2010 9:38 am

This came up in the "we all hate XC because..." thread, but it's of course buried
there, so let's start a new topic.

A few things I noticed I would like in the architecture:

- Immediate arg instructions for pretty much everything, but especially lss, lsu,
and the various load (and maybe store) insns, like ld8u. There is of course not
space in the opcode map for 2-byte insns for these, but there is a lot of space
for 4-byte insns. The advantages of having such insns:
- Not always is the immediate 0 (but some other small number, say 4), so it
will always take an extra register load insn currently, you cannot reuse the reg
you load the imm in for a few ops (or even blow a whole reg globally to hold
the value, like you can do for 0);
- It will take the same space as a pair of load immediate, op-with-reg insns,
but it is (or can be) faster;
- It makes life slightly easier for compilers.

- Performance monitor counters. Some obvious ones: count cycles, count instructions,
count (cycles spent on) filler nops, count cycles waiting for resources (per resource?).
Perhaps trap to debug mode when a counter reaches some (power of two?) value.
Some of these should be per thread, but the resource ones not I suppose.

Christmas is coming up, anyone else have a wishlist? :-)

Post by **lilltroll** » Sat Nov 06, 2010 2:28 pm

Maximum open source...

Download the XMOS emulation FPGA code and add your own instructions.
The best implementation will get the instruction set added in the next XC(n) implementation :mrgreen:

An alternative question would be, how do you spend n transistors the best way ?
(Let's assume 45 nm process based on 193nm UV, but as I understand it gets rapidly more expensive going below 90/65 nm since you have to use so many tricks to be able to use the 193nm UV)

Double the amount of SRAM must be at least 6*65536 + overhead >400 000 transistors per core
I guess you also could spend them for an larger set of instructions instead.
The alternative is to use a SDRAM in wait-state 0 @ +100 MHz, 32bit data 16bit adress for each "page"
If it also had a counter, which could update the address for sequential read, you could fetch a new int each cycle ??
Adding a memory-controller to the switch would be?? Very stupid idea??

To be able to connect counters to macs instruction etc would be great.
The MAC already has 6 operands, but if you instead could connect a counter during the init. as with the port logic, you could save the stupid +1 instruction.
Also using circular counters would save a lot of SRAM when you have to choose between speed or memory using FIR filters.
On the other-hand, it easy to distribute a FIR filter on one core. I played with >55 M taps /s on one G4 core, for one distributed FIR filter.

In the end it's about an optimization for the actual users.

segher · Post by **segher** » Sat Nov 06, 2010 5:46 pm

Number of transistors doesn't matter much anymore, as long as you don't switch
them all the time. You have a minimum die size simply because of the number of
contacts you need for the I/O pins, and that will fit a lot of transistors at 65nm :-)

Doubling RAM costs a lot of power, and a significant number of transistors; I would
think they don't use 6T for the ram, but 8T (or some 9T or 10T). Even with 6T it
would be about half a million transistors per core; adding some extra instructions
on the other hand costs only a handful of gates!

I agree about adding some kind of memory controller, but that's a whole separate
topic :-)

segher · Post by **segher** » Sun Nov 21, 2010 1:52 am

Here's three more:

GETR that takes a register for the resource type. Not useful terribly
often, and when you need it you can do a switch() thing, but still :-)

An insn to get LR into a normal register. Useful for inlined constants.

A long add that adds two double-register numbers together. There
is hardware to do that already as far as I can see, for the MAC stuff.

ale500 · Post by **ale500** » Thu Jan 20, 2011 11:22 am

I was just wondering about some of those points and what I think would be most appreciated is a lot more SRAM. At least 4 times more if not more. The instruction set is ok. You can get a double add using 4 short adds but more registers would be sweeter, sadly they do not fit in the assembler. The way they compress it to fit so much in 16 bits is really the key to high-speed :). I wonder what the next silicon would bring. I'm sure they already have something even more amazing in the cooks :)

segher · Post by **segher** » Fri Jan 21, 2011 3:50 am

ale500 wrote:I was just wondering about some of those points and what I think would be most appreciated is a lot more SRAM. At least 4 times more if not more.

But as said before, that will likely not fit on the chip, in either space or power.

64kB is a lot anyway, if you aren't totally dumb in how you use it. Sometimes you really need more,
but then you need a couple of MB, which certainly won't fit on the chip.

Anyway, this thread is about the architecture, memory size is an implementation detail :-)

The instruction set is ok.

It's very nice, indeed -- but it could be just a tiny bit nicer still :-)

You can get a double add using 4 short adds

You can do it in two instructions actually. But two is more than one.

but more registers would be sweeter,

I think 12 general purpose registers is actually quite a sweet spot. With 16, you couldn't have
as many three-register ops, so you'd have to use an encodiing with dest=src often, which is
awful to work with (and requires more insns in many cases); and 8 is really not enough either.

ale500 · Post by **ale500** » Fri Jan 21, 2011 11:15 am

Anyway, this thread is about the architecture, memory size is an implementation detail :-)

I disagree here. If you look closely at the instructions for memory access, 64 kB is just a natural fit. So it is, imho, part of the architecture :)

Post by **lilltroll** » Sat Jan 22, 2011 3:22 pm

I believe that David responded along time ago that instructions can be fitted to adress larger memory-areas, if that would be they way to go.
64 kB L1 cache is alot on most architectures. Some vendors of Low Power chips (That doesn't need heavy cooling) uses Package-On-Package for SDRAM. That way you save PCB path's.

http://upload.wikimedia.org/wikipedia/e ... ematic.JPG

I agree a little with that either you need alot of RAM or you can do it with 64 kB.
Just as an example to play with: If you could buy a L4 with 4x64 kB SRAM or a L2 with 2x256 kB, for the same price, which one would sell best ?
Or if we had the choice of a L4 4x64 kB SRAM with a 256 Mbit of RAM of the top, using the on-die hi-speed links directly from the Switch, so all cores could adress the space with hi bandwith, and no output pins were stolen for memory connection.

Post by **Folknology** » Sat Jan 22, 2011 4:47 pm

I agree a little with that either you need alot of RAM or you can do it with 64 kB.
Just as an example to play with: If you could buy a L4 with 4x64 kB SRAM or a L2 with 2x256 kB, for the same price, which one would sell best ?
Or if we had the choice of a L4 4x64 kB SRAM with a 256 Mbit of RAM of the top, using the on-die hi-speed links directly from the Switch, so all cores could adress the space with hi bandwith, and no output pins were stolen for memory connection.

I would love an L2 with 2*64k with external (mapped) memoryinterface in a 144pin package with regular spacing (0.5mm)

ale500 · Post by **ale500** » Sat Jan 22, 2011 6:14 pm

You are not the only one!

Wishlist for architecture

Wishlist for architecture

Re: Wishlist for architecture

Re: Wishlist for architecture

Re: Wishlist for architecture

Re: Wishlist for architecture

Re: Wishlist for architecture

Re: Wishlist for architecture

Re: Wishlist for architecture

Re: Wishlist for architecture

Re: Wishlist for architecture