Hi David,
Thanks for your thoughts.
It is in part the off-chip/on-chip disparity that makes me think
that a co-processor style arrangement for comms is what is needed.
As soon as the CPU has to go off-chip all the latencies grow
massively, so keep it local! Again, the need to flip to kernel
mode and initialise DMA and interrupts makes me feel that a
coprocessor that "understands" what is going on, and can therefore
avoid needing to be told everything explicitly each time, is the
right path. Ideally, the 'start communication' instruction would
take as long to execute as a multiply, essentially transferring
values from cpu registers to a queue on the hardware, with another
instruction that executes a software trap if the operation hasn't
completed (so the calling thread can wait on completion).
The x86_64 cpus from Intel and AMD both have very high speed
inter-core interconnects, though I know little about them. They
are apparently good for bus snooping and therefore of tranferring
short (cache-line = 64 .. 256 bytes) length blocks of data very
quickly. It would be really interesting to know if this hardware
could be targetted by specialised processor instructions to make
it useful for explicit communication, rather than implicit.
You're right about memory, of course, though of late the DDR line of memory has been improving faster than it did for quite a while (for about 10 years, performance improved by about 20%, but in the last 3 or 4 it's almost doubled). However, that doubling performance is set against the 100 to 1000 fold performance differential to the cpu. Recent generations of memory interface have been trying to address this with wider memory busses -- some Intel chips now have 4 parallel channels of DDR4 memory for 1024-bit wide interface, but although this improves overall bus speed it does little for the latency of any individual transfer. The famous "Spectre" (etc) bugs in chip design come about mostly in the efforts to hide this latency from the core.
The only way I've seen of addressing the issue is to being memory on-chip... which is what High Bandwidth Memory (HBM) does. It puts a large DRAM chip physically on top of the processor, with parallel interconnections all across the plane. Thus both a wide memory interface, and a much higher speed (short & small wires) one can be achieved. The downside is that the size of memory is limited to what can fit in one chip, and it also limits processor power dissipation.
No magic bullets then. But it will take a lot of pressure to push
the software industry to change their methods to enable the
adoption of more parallel-friendly hardware, and I don't see that
happening at present. The current pressures on programming are all
aimed at reducing the cost to market, even at the expense of later
maintenance. Achieved software quality is (IMO) low, and getting
worse.
Best wishes,
Ruth
Hi Ruth,
There are quite a few studies of HPC message latency - for example https://www.hpcwire.com/2019/12/02/it-is-about-latency/.
Ideally, a processor should be able to communicate data as fast as it can process data. And ideally, this should apply to small items of data, not just large ones. We’d like to be able to off-load a procedure call to another processor, for example. The transputer came close to this in 1985. Since then, processor performance has increased by a factor of more than 1,000 (a combination of clock speed and instruction-level parallelism), but communication performance hasn’t increased much at all (except for very large data items). The issue seems to be primarily the set-up and completion times - switch to kernel, initialise DMA controller, de-schedule process … interrupt, re-schedule process. However, there is also an issue with chip-to-chip communication. If you build chips with lots of processors - potentially hundreds with 5nm process technology - the imbalance between on-chip processing performance and inter-chip communication performance is vast (both latency and throughput).
I made some proposals about these issues in http://www.icpp-conf.org/2017/files/keynote-david-may.pdf (especially from slide 27).
Incidentally, the problem with most current architectures isn’t just communications - the memory systems don’t work well either (see from slide 17 of above)!
It is entirely possible to fix the problems with communication set-up and completion - one option is the microprogrammed transputer-style implementation; another is the multithreaded XCore style with single-cycle i/o instructions (which means that threads can act as programmable DMA controllers); yet another is to provide execution modes (with dedicated registers) for complex instructions and interrupts - this enables the process management to be done by software and was first used on the Atlas computers of the 1960s (Normal control, Extracode control, Interrupt control)!
All the best
David
On 5 Dec 2020, at 03:04, Ruth Ivimey-Cook <ruth@xxxxxxxxxx> wrote:
David,
Did you see my comments earlier, where I was talking about, among other things, my ideas for a modern transputer? I mentioned various points about latency, and would be interested on your thoughts. I have quoted the relevant part below in case not.
I'm raising it again because you mentioned the 1us latency (I presume this is the time to indicate a communication is needed, not to complete it, so no overlapping of this with other work). It doesn't surprise me: most I/O subsystems seem to be adequately optimised for bulk throughput but really awful on startup latency.
On 04/12/2020 19:24, David May wrote:
This issue - and the related issue of processor-interconnect latency - is the main reason that we’re not doing much parallel computing. The interprocessor communication latency is still around 1microsecond (same as transputer) for a short message. As a result, many ‘supercomputers’ are just used as clusters running scripts that launch lots of small jobs.
Best wishes,
Ruth
On 23 Nov 2020 23:35, Ruth Ivimey-Cook wrote:
One thing I have been contemplating for some time is what it would take to make a modern transputer. I feel the critical element of a design is provision of channel/link hardware that has minimal setup time and DMA driven access. I feel reducing latency, especially setup time, requires a coprocessor-like interface to the cpu, so that a single processor instruction can initiate comms. If hardware access were required over PCIe, for example, it would take hundreds of processor cycles. Pushing that work into a coprocessor enables the main cpu to get on with other things, and maximises the chance that the comms will happen as fast as possible.
The other side of the coin would be that if the link engine(s) were essentially all wormhole routers, as for the C104 router chips, complete with packet addressing. Thus the link coprocessor would essentially become some number of interfaces directly to the CPU plus some number of interfaces to the external world, with a crossbar in the middle. This would massively increase the communications effectiveness of the design, and while taking up much more silicon area, I believe it would be an overall benefit for any non-trivial system. One net result is the elimination of the massive amount of 'receive and pass on' routing code that used to be needed with a directly connected link design.
The final element of the mix would be to engineer the system such that software virtualisation of links was standard -- as was true on the transputer -- so code could think just about performing communication, not about which physical layer was involved, and also a way for the link engine to raise an exception (e.g. sw interrupt) to the processor if it cannot complete the communication 'instantly', thus potentially requiring a thread to suspend or be released.
I don't know, but from what I have seen so far I don't think it is worth the complexity and constraint of putting supprot for interleaved threads into the processor hardware, as the Ts did, but do feel it is valuable for the hardware to provide appropriate hooks for a light threaded kernel to do the job efficiently.
Hi Ruth,
There are quite a few studies of HPC message latency - for example https://www.hpcwire.com/2019/12/02/it-is-about-latency/.
Ideally, a processor should be able to communicate data as fast as it can process data. And ideally, this should apply to small items of data, not just large ones. We’d like to be able to off-load a procedure call to another processor, for example. The transputer came close to this in 1985. Since then, processor performance has increased by a factor of more than 1,000 (a combination of clock speed and instruction-level parallelism), but communication performance hasn’t increased much at all (except for very large data items). The issue seems to be primarily the set-up and completion times - switch to kernel, initialise DMA controller, de-schedule process … interrupt, re-schedule process. However, there is also an issue with chip-to-chip communication. If you build chips with lots of processors - potentially hundreds with 5nm process technology - the imbalance between on-chip processing performance and inter-chip communication performance is vast (both latency and throughput).
I made some proposals about these issues in http://www.icpp-conf.org/2017/files/keynote-david-may.pdf (especially from slide 27).
Incidentally, the problem with most current architectures isn’t just communications - the memory systems don’t work well either (see from slide 17 of above)!
It is entirely possible to fix the problems with communication set-up and completion - one option is the microprogrammed transputer-style implementation; another is the multithreaded XCore style with single-cycle i/o instructions (which means that threads can act as programmable DMA controllers); yet another is to provide execution modes (with dedicated registers) for complex instructions and interrupts - this enables the process management to be done by software and was first used on the Atlas computers of the 1960s (Normal control, Extracode control, Interrupt control)!
All the best
David
On 5 Dec 2020, at 03:04, Ruth Ivimey-Cook <ruth@xxxxxxxxxx> wrote:
David,
Did you see my comments earlier, where I was talking about, among other things, my ideas for a modern transputer? I mentioned various points about latency, and would be interested on your thoughts. I have quoted the relevant part below in case not.
I'm raising it again because you mentioned the 1us latency (I presume this is the time to indicate a communication is needed, not to complete it, so no overlapping of this with other work). It doesn't surprise me: most I/O subsystems seem to be adequately optimised for bulk throughput but really awful on startup latency.
On 04/12/2020 19:24, David May wrote:
This issue - and the related issue of processor-interconnect latency - is the main reason that we’re not doing much parallel computing. The interprocessor communication latency is still around 1microsecond (same as transputer) for a short message. As a result, many ‘supercomputers’ are just used as clusters running scripts that launch lots of small jobs.
Best wishes,
Ruth
On 23 Nov 2020 23:35, Ruth Ivimey-Cook wrote:
One thing I have been contemplating for some time is what it would take to make a modern transputer. I feel the critical element of a design is provision of channel/link hardware that has minimal setup time and DMA driven access. I feel reducing latency, especially setup time, requires a coprocessor-like interface to the cpu, so that a single processor instruction can initiate comms. If hardware access were required over PCIe, for example, it would take hundreds of processor cycles. Pushing that work into a coprocessor enables the main cpu to get on with other things, and maximises the chance that the comms will happen as fast as possible.
The other side of the coin would be that if the link engine(s) were essentially all wormhole routers, as for the C104 router chips, complete with packet addressing. Thus the link coprocessor would essentially become some number of interfaces directly to the CPU plus some number of interfaces to the external world, with a crossbar in the middle. This would massively increase the communications effectiveness of the design, and while taking up much more silicon area, I believe it would be an overall benefit for any non-trivial system. One net result is the elimination of the massive amount of 'receive and pass on' routing code that used to be needed with a directly connected link design.
The final element of the mix would be to engineer the system such that software virtualisation of links was standard -- as was true on the transputer -- so code could think just about performing communication, not about which physical layer was involved, and also a way for the link engine to raise an exception (e.g. sw interrupt) to the processor if it cannot complete the communication 'instantly', thus potentially requiring a thread to suspend or be released.
I don't know, but from what I have seen so far I don't think it is worth the complexity and constraint of putting supprot for interleaved threads into the processor hardware, as the Ts did, but do feel it is valuable for the hardware to provide appropriate hooks for a light threaded kernel to do the job efficiently.