But the Arm has "out-of-order execution pipeline and can execute two-threads in parallel on each cycle". Does it mean that even if they sometimes have to rearrange instructions for speed etc. they still can execute two threads in parallel, ie. two CPUs per core? Is this like an xCORE ("on each cycle", even if xCORE smarter/simpler multiplexes the cycles out) minus the pipeline (the xCORE has no pipeline... right?) and two logical cores instead of 8?
Would the xCORE architecture be able to scale up to large machines like that, or do they have to end up like "monstrous"(?) 200W++ designs? Or are these designs really as elegant as they would have me to believe?
Ampere Altra: The World’s First Cloud Native Processor
Ampere Altra Addresses ARM Aspirationshttps://amperecomputing.com/altra/
Ampere™ Altra™ offers up to 80 cores at up to 3.0 GHz speed with sustained turbo performance. Each core is single threaded by design with its own 64 KB L1 I-cache, 64 KB L1 D-cache, and a huge 1 MB L2 D-cache, delivering predictable performance 100% of the time by fully eliminating the noisy neighbor challenge within each core.
CPU. CORTEX-A65. A Multithreaded DynamIQ CPUCloud Server Chip Has 80 CPU Cores and Big Shoes to Fill
by Jim Turley
https://www.eejournal.com/article/amper ... irations/
What they don’t have is multithreading.
The lack of multithreading seems odd in such a high-end processor, especially since the obvious competitors from Intel and AMD have both offered that feature for years as a matter of course. It’s expected, and ARM fully supports it. AMD’s 64-core Epyc 7742 supports 128 threads, versus 80 threads for Altra. So why no multithreading in Altra?
Ampere seems almost defensive in justifying its single-threaded design decision, like a mother protecting her newborn, but the company may have a valid point. Server systems really do differ from their desktop ancestors in at least one case: they run workloads for several unrelated users. Most x86 systems, even big ones, tend to run large applications for a single user, and they benefit from chopping up that workload into multiple threads. The job is done when all threads complete. But cloud servers, by their very nature, tend to run smaller, unrelated tasks from isolated users. There’s less commonality among tasks (none at all, really) and therefore less reason to share computing resources, caches, and program state. It might even be counterproductive for two (or more) threads to share a processor core and its caches; they’d interfere with each other.
Whatever the reasoning, Altra executes just one thread per processor, giving it less performance potential than its x86 opponents. Whether that translates into less actual performance on the relevant server workloads remains to be seen. The benchmarks suggest it’s not a problem.
https://www.arm.com/products/silicon-ip ... ortex-a65
Arm Cortex-A65 is a multithreaded Cortex-A DynamIQ CPU, delivering highest levels of throughput efficiency. It can process two threads simultaneously and scales up to eight cores in a single cluster. The thermally efficient Cortex-A65 is designed for non-safety applications such as navigation, sensor fusion and vision-based systems.
The multithreaded processor has an out-of-order execution pipeline and can execute two-threads in parallel on each cycle. Each thread can be at different exception levels and running different operating systems