Ruth and David,
I am trying to follow your discussion here, but let me cut to the chase with the implication that communication setup time has stalled at 1 us and not improved since the Transputer. Measured in cycles, that seems to mean communication setup time has INCREASED from 20 cycles to about 2000 cycles. This problem is not new. I remember one of the occam porting projects, to (I believe) the PowerPC (I can't find the reference), actually tackled external communication, but the overhead was about 700 cycles if I remember right.
Fixing the problems (as David says) that slow you down by a factor of 100 certainly seems to make sense. I wonder what the physical limits are if you try to make a 2GHz Transputer? Speed of light for 20 cycles would still give you 3 meters. I know this is crude extrapolation, but . . . I wonder if the real problem is that people have given up on massively parallel programming (except for GPUs).
Larry
Hi David, Thanks for your thoughts. It is in part the off-chip/on-chip disparity that makes me think
that a co-processor style arrangement for comms is what is needed.
As soon as the CPU has to go off-chip all the latencies grow
massively, so keep it local! Again, the need to flip to kernel
mode and initialise DMA and interrupts makes me feel that a
coprocessor that "understands" what is going on, and can therefore
avoid needing to be told everything explicitly each time, is the
right path. Ideally, the 'start communication' instruction would
take as long to execute as a multiply, essentially transferring
values from cpu registers to a queue on the hardware, with another
instruction that executes a software trap if the operation hasn't
completed (so the calling thread can wait on completion).
The x86_64 cpus from Intel and AMD both have very high speed
inter-core interconnects, though I know little about them. They
are apparently good for bus snooping and therefore of tranferring
short (cache-line = 64 .. 256 bytes) length blocks of data very
quickly. It would be really interesting to know if this hardware
could be targetted by specialised processor instructions to make
it useful for explicit communication, rather than implicit.
You're right about memory, of course, though of late the DDR line
of memory has been improving faster than it did for quite a while
(for about 10 years, performance improved by about 20%, but in the
last 3 or 4 it's almost doubled). However, that doubling
performance is set against the 100 to 1000 fold performance
differential to the cpu. Recent
generations of memory interface have been trying to address this
with wider memory busses -- some Intel chips now have 4 parallel
channels of DDR4 memory for 1024-bit wide interface, but although
this improves overall bus speed it does little for the latency of
any individual transfer. The famous "Spectre" (etc) bugs in chip
design come about mostly in the efforts to hide this latency from
the core. The only way I've seen of addressing the issue is to being memory
on-chip... which is what High Bandwidth Memory (HBM) does. It puts
a large DRAM chip physically on top of the processor, with
parallel interconnections all across the plane. Thus both a wide
memory interface, and a much higher speed (short & small
wires) one can be achieved. The downside is that the size of
memory is limited to what can fit in one chip, and it also limits
processor power dissipation. No magic bullets then. But it will take a lot of pressure to push
the software industry to change their methods to enable the
adoption of more parallel-friendly hardware, and I don't see that
happening at present. The current pressures on programming are all
aimed at reducing the cost to market, even at the expense of later
maintenance. Achieved software quality is (IMO) low, and getting
worse.
Best wishes, Ruth
On 06/12/2020 18:17, David May wrote:
Hi Ruth,
Ideally, a processor should be able to communicate
data as fast as it can process data. And ideally, this should
apply to small items of data, not just large ones. We’d like to
be able to off-load a procedure call to another processor, for
example. The transputer came close to this in 1985. Since then,
processor performance has increased by a factor of more than
1,000 (a combination of clock speed and instruction-level
parallelism), but communication performance hasn’t increased
much at all (except for very large data items). The issue seems
to be primarily the set-up and completion times - switch to
kernel, initialise DMA controller, de-schedule process …
interrupt, re-schedule process. However, there is also an issue
with chip-to-chip communication. If you build chips with lots of
processors - potentially hundreds with 5nm process technology -
the imbalance between on-chip processing performance and
inter-chip communication performance is vast (both latency and
throughput).
Incidentally, the problem with most current
architectures isn’t just communications - the memory systems
don’t work well either (see from slide 17 of above)!
It is entirely possible to fix the problems with
communication set-up and completion - one option is the
microprogrammed transputer-style implementation; another is the
multithreaded XCore style with single-cycle i/o instructions
(which means that threads can act as programmable DMA
controllers); yet another is to provide execution modes (with
dedicated registers) for complex instructions and interrupts -
this enables the process management to be done by software and
was first used on the Atlas computers of the 1960s (Normal
control, Extracode control, Interrupt control)!
All the best
David
David, Did you see my comments earlier, where I was
talking about, among other things, my ideas for a
modern transputer? I mentioned various points about
latency, and would be interested on your thoughts. I
have quoted the relevant part below in case not. I'm raising it again because you mentioned
the 1us latency (I presume this is the time to
indicate a communication is needed, not to complete
it, so no overlapping of this with other work). It
doesn't surprise me: most I/O subsystems seem to be
adequately optimised for bulk throughput but really
awful on startup latency.
On 04/12/2020 19:24, David
May wrote:
This issue - and the related issue of
processor-interconnect latency - is the main reason
that we’re not doing much parallel computing. The
interprocessor communication latency is still around
1microsecond (same as transputer) for a short
message. As a result, many ‘supercomputers’ are just
used as clusters running scripts that launch lots of
small jobs.
Best wishes, Ruth
On 23 Nov 2020 23:35, Ruth Ivimey-Cook
wrote:
One thing I have been
contemplating for some time is what it would take
to make a modern transputer. I feel the critical
element of a design is provision of channel/link
hardware that has minimal setup time and DMA
driven access. I feel reducing latency, especially
setup time, requires a coprocessor-like interface
to the cpu, so that a single processor instruction
can initiate comms. If hardware access were
required over PCIe, for example, it would take
hundreds of processor cycles. Pushing that work
into a coprocessor enables the main cpu to get on
with other things, and maximises the chance that
the comms will happen as fast as possible. The other side of the coin
would be that if the link engine(s) were
essentially all wormhole routers, as for the C104
router chips, complete with packet addressing.
Thus the link coprocessor would essentially become
some number of interfaces directly to the CPU plus
some number of interfaces to the external world,
with a crossbar in the middle. This would
massively increase the communications
effectiveness of the design, and while taking up
much more silicon area, I believe it would be an
overall benefit for any non-trivial system. One
net result is the elimination of the massive
amount of 'receive and pass on' routing code that
used to be needed with a directly connected link
design.
The final element of the mix
would be to engineer the system such that software
virtualisation of links was standard -- as was
true on the transputer -- so code could think just
about performing communication, not about which
physical layer was involved, and also a way for
the link engine to raise an exception (e.g. sw
interrupt) to the processor if it cannot complete
the communication 'instantly', thus potentially
requiring a thread to suspend or be released.
I don't know, but from what I have seen so
far I don't think it is worth the complexity and
constraint of putting supprot for interleaved
threads into the processor hardware, as the Ts did,
but do feel it is valuable for the hardware to
provide appropriate hooks for a light threaded
kernel to do the job efficiently.
On 06/12/2020 18:17, David May wrote:
Hi Ruth,
Ideally, a processor should be able to communicate
data as fast as it can process data. And ideally, this should
apply to small items of data, not just large ones. We’d like to
be able to off-load a procedure call to another processor, for
example. The transputer came close to this in 1985. Since then,
processor performance has increased by a factor of more than
1,000 (a combination of clock speed and instruction-level
parallelism), but communication performance hasn’t increased
much at all (except for very large data items). The issue seems
to be primarily the set-up and completion times - switch to
kernel, initialise DMA controller, de-schedule process …
interrupt, re-schedule process. However, there is also an issue
with chip-to-chip communication. If you build chips with lots of
processors - potentially hundreds with 5nm process technology -
the imbalance between on-chip processing performance and
inter-chip communication performance is vast (both latency and
throughput).
Incidentally, the problem with most current
architectures isn’t just communications - the memory systems
don’t work well either (see from slide 17 of above)!
It is entirely possible to fix the problems with
communication set-up and completion - one option is the
microprogrammed transputer-style implementation; another is the
multithreaded XCore style with single-cycle i/o instructions
(which means that threads can act as programmable DMA
controllers); yet another is to provide execution modes (with
dedicated registers) for complex instructions and interrupts -
this enables the process management to be done by software and
was first used on the Atlas computers of the 1960s (Normal
control, Extracode control, Interrupt control)!
All the best
David
David, Did you see my comments earlier, where I was
talking about, among other things, my ideas for a
modern transputer? I mentioned various points about
latency, and would be interested on your thoughts. I
have quoted the relevant part below in case not. I'm raising it again because you mentioned
the 1us latency (I presume this is the time to
indicate a communication is needed, not to complete
it, so no overlapping of this with other work). It
doesn't surprise me: most I/O subsystems seem to be
adequately optimised for bulk throughput but really
awful on startup latency.
On 04/12/2020 19:24, David
May wrote:
This issue - and the related issue of
processor-interconnect latency - is the main reason
that we’re not doing much parallel computing. The
interprocessor communication latency is still around
1microsecond (same as transputer) for a short
message. As a result, many ‘supercomputers’ are just
used as clusters running scripts that launch lots of
small jobs.
Best wishes, Ruth
On 23 Nov 2020 23:35, Ruth Ivimey-Cook
wrote:
One thing I have been
contemplating for some time is what it would take
to make a modern transputer. I feel the critical
element of a design is provision of channel/link
hardware that has minimal setup time and DMA
driven access. I feel reducing latency, especially
setup time, requires a coprocessor-like interface
to the cpu, so that a single processor instruction
can initiate comms. If hardware access were
required over PCIe, for example, it would take
hundreds of processor cycles. Pushing that work
into a coprocessor enables the main cpu to get on
with other things, and maximises the chance that
the comms will happen as fast as possible. The other side of the coin
would be that if the link engine(s) were
essentially all wormhole routers, as for the C104
router chips, complete with packet addressing.
Thus the link coprocessor would essentially become
some number of interfaces directly to the CPU plus
some number of interfaces to the external world,
with a crossbar in the middle. This would
massively increase the communications
effectiveness of the design, and while taking up
much more silicon area, I believe it would be an
overall benefit for any non-trivial system. One
net result is the elimination of the massive
amount of 'receive and pass on' routing code that
used to be needed with a directly connected link
design.
The final element of the mix
would be to engineer the system such that software
virtualisation of links was standard -- as was
true on the transputer -- so code could think just
about performing communication, not about which
physical layer was involved, and also a way for
the link engine to raise an exception (e.g. sw
interrupt) to the processor if it cannot complete
the communication 'instantly', thus potentially
requiring a thread to suspend or be released.
I don't know, but from what I have seen so
far I don't think it is worth the complexity and
constraint of putting supprot for interleaved
threads into the processor hardware, as the Ts did,
but do feel it is valuable for the hardware to
provide appropriate hooks for a light threaded
kernel to do the job efficiently.
|