David,
Did you see my comments earlier, where I was talking about, among other things, my ideas for a modern transputer? I mentioned various points about latency, and would be interested on your thoughts. I have quoted the relevant part below in case not.
I'm raising it again because you mentioned the 1us latency (I
presume this is the time to indicate a communication is needed,
not to complete it, so no overlapping of this with other work). It
doesn't surprise me: most I/O subsystems seem to be adequately
optimised for bulk throughput but really awful on startup latency.
This issue - and the related issue of processor-interconnect latency - is the main reason that we’re not doing much parallel computing. The interprocessor communication latency is still around 1microsecond (same as transputer) for a short message. As a result, many ‘supercomputers’ are just used as clusters running scripts that launch lots of small jobs.
Best wishes,
Ruth
On 23 Nov 2020 23:35, Ruth Ivimey-Cook wrote:
One thing I have been contemplating for some time is what it would take to make a modern transputer. I feel the critical element of a design is provision of channel/link hardware that has minimal setup time and DMA driven access. I feel reducing latency, especially setup time, requires a coprocessor-like interface to the cpu, so that a single processor instruction can initiate comms. If hardware access were required over PCIe, for example, it would take hundreds of processor cycles. Pushing that work into a coprocessor enables the main cpu to get on with other things, and maximises the chance that the comms will happen as fast as possible.
The other side of the coin would be that if the link engine(s) were essentially all wormhole routers, as for the C104 router chips, complete with packet addressing. Thus the link coprocessor would essentially become some number of interfaces directly to the CPU plus some number of interfaces to the external world, with a crossbar in the middle. This would massively increase the communications effectiveness of the design, and while taking up much more silicon area, I believe it would be an overall benefit for any non-trivial system. One net result is the elimination of the massive amount of 'receive and pass on' routing code that used to be needed with a directly connected link design.
The final element of the mix would be to engineer the system such that software virtualisation of links was standard -- as was true on the transputer -- so code could think just about performing communication, not about which physical layer was involved, and also a way for the link engine to raise an exception (e.g. sw interrupt) to the processor if it cannot complete the communication 'instantly', thus potentially requiring a thread to suspend or be released.
I don't know, but from what I have seen so far I don't think it is worth the complexity and constraint of putting supprot for interleaved threads into the processor hardware, as the Ts did, but do feel it is valuable for the hardware to provide appropriate hooks for a light threaded kernel to do the job efficiently.