[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Transistor count



Roger,

I THINK I understand what you mean by high valency, something like a select() over a large number of channels (or parallel output to a large number of channels), but it simply is not required if you can accept logarithmic delays. For instance, a triangular array has 90,000 leaf nodes (300 squared), but the fourth link on each leaf connects to a twig, three leaves to a twig, and the fourth link on each twig connects to a branchlet, etc. 45,005 gatherers stream all the leaves' data to one point in 11 steps and nobody has a fanout of more than 4, or needs to communicate with anything distant.

Of course I am ignoring data throughput questions, but they can be dealt with using extra gatherers without increasing the logarithmic latency. It scales every day to large problems - check out your high-resolution GPU.

As for 100 nodes sharing read-only code, given an instruction cache on each node, it ought to be doable with FIFOs. Has anyone really tried? It seems to be no different from lots of processes accessing a solid state disk, and we do that all the time.

Larry

On Nov 24, 2020, at 2:37 PM, Roger Shepherd <rog@xxxxxxxx> wrote:

> Larry
> 
>> On 24 Nov 2020, at 16:56, Larry Dickson <tjoccam@xxxxxxxxxxx> wrote:
>> 
>> ======
>> I also think the 4 links is not enough.  6 should be a minimum for 3D simulations and 10-20 would be better more up to date unstructured problems.
>> ======
>> 
>> I did some work in 1996 on classic Transputers, using Forrest Crowell and Neal Elzenga's published measurements, that indicated you could go to 8 links full speed both ways and still be running CPU calculations at 48% efficiency between the DMA cycle stealing (see my Nov 19 9:55 AM PST message on this thread). That does not get you to 20, but hardware can be redesigned if link count increase is needed, e.g. by using 16 or 32 byte "words" in your DMA. By the way, you can get away with 4 in 3D if you have to (five tetrahedra make a cube), but it is clumsy.
> 
> You need high fanout (high-valence) for achieving general purpose connectivity. But you don’t need high valance to handle the bandwidth from a single processing engine. That’s (one reason) why the second generation transputer architecture split compute nodes and communication nodes.
> 
> You WILL run into problems with any system with limited connectivity - it doesn’t scale to large problems.
> 
> Roger
> 
>> Larry
> 
> 
> 
>