Re: Transistor count

On 8 Dec 2020, at 01:34, Larry Dickson <tjoccam@xxxxxxxxxxx> wrote:

Ruth and David,

I am trying to follow your discussion here, but let me cut to the chase with the implication that communication setup time has stalled at 1 us and not improved since the Transputer. Measured in cycles, that seems to mean communication setup time has INCREASED from 20 cycles to about 2000 cycles. This problem is not new. I remember one of the occam porting projects, to (I believe) the PowerPC (I can't find the reference), actually tackled external communication, but the overhead was about 700 cycles if I remember right.

Fixing the problems (as David says) that slow you down by a factor of 100 certainly seems to make sense. I wonder what the physical limits are if you try to make a 2GHz Transputer?

What do you mean by 2 GHz transputer? Same configuration(RAM/links) as (e.g. T800)? 100x the performance (i.e time to do multiply 100x faster)?

Although 100x sounds like it should be easy there are some things to take into account.

The transputer design used 4-phase clocking - much more happened on a single clock cycle than happens in modern 2-phase design. In other words, you might consider the design to effectively be 2x or so faster than a modern 20MHz design. So we should be thinking about 200x for the circuitry. 4 GHz is fast but the design is small and simple so should work.

In terms of the places there might be problems in the microarchitecture, the key thing would be to execute ldnl in 1 ns. (2 x 1 / (2 GHz). This is doable; the Apple A14 does better than this, it manages to perform a similar access to a 128KB L1 data cache with 3 cycle latency at 3 Ghz (i.e. 1 ns). So I think it must be the case that the basic processor/RAM system can be built.

I think the system clocking and coms system would take some work. A 2 GHz link for local on-chip communication could be doable - we assume you have a square grid of transputers with local connections. The challenge would be keeping the latency - and hence throughput - as low as it is in the transputer design. Where we communicate between different clock regimes we will incur delays (some number of cycles - and remember an ack packet is only two (link) cycles long.

Speed of light for 20 cycles would still give you 3 meters.

Speed of light isn’t the issue for circuits on silicon. Signals propagate much slower than this on-chip.

[For large systems the problem is that the techniques used to build high-bandwidth systems incur very large latencies - you can build very high bandwidth long-distance interconnects but you face a latency problem.]

I know this is crude extrapolation, but . . . I wonder if the real problem is that people have given up on massively parallel programming (except for GPUs).

I think the problem is that there is little work tackling the combined problem of massively parallel programming and computer architecture together.

Roger

Larry