[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Transistor count



Larry,

On 25 Nov 2020, at 01:19, Larry Dickson <tjoccam@xxxxxxxxxxx> wrote:

Roger,

I THINK I understand what you mean by high valency, something like a select() over a large number of channels (or parallel output to a large number of channels), but it simply is not required if you can accept logarithmic delays. For instance, a triangular array has 90,000 leaf nodes (300 squared), but the fourth link on each leaf connects to a twig, three leaves to a twig, and the fourth link on each twig connects to a branchlet, etc. 45,005 gatherers stream all the leaves' data to one point in 11 steps and nobody has a fanout of more than 4, or needs to communicate with anything distant.

I don’t understand. I’m missing the model of computation you are thinking of. I believe that interesting problems don’t map on to simple structures and the only way to make those problems programmable is to support an abstract model of programming where you’re able to address problems with non-local communication. Proposals were made around 1990 for how to do this and how to hide any mismatch between the physical connectivity of the world and the logical connectivity of a program. There’s also a lot of understanding of how communication networks behave and what we need to do to make them effective. Low connectivity leads to longer latency which leads to a requirement for a high degree of excess parallelism, which increases the size of the problem you can tackle. Or to put it another way, limits the benefit if parallelism for small problems. This is a practical, engineering issue. 

Of course I am ignoring data throughput questions, but they can be dealt with using extra gatherers without increasing the logarithmic latency. It scales every day to large problems - check out your high-resolution GPU.

As for 100 nodes sharing read-only code, given an instruction cache on each node, it ought to be doable with FIFOs. Has anyone really tried? It seems to be no different from lots of processes accessing a solid state disk, and we do that all the time.

When do we do this? Sure there can be a lot of operating system processes accessing a disk, but they absolutely aren’t running at the same time - and if you can happily time multiplex a processor why replicate it? 

Roger


Larry

On Nov 24, 2020, at 2:37 PM, Roger Shepherd <rog@xxxxxxxx> wrote:

Larry

On 24 Nov 2020, at 16:56, Larry Dickson <tjoccam@xxxxxxxxxxx> wrote:

======
I also think the 4 links is not enough.  6 should be a minimum for 3D simulations and 10-20 would be better more up to date unstructured problems.
======

I did some work in 1996 on classic Transputers, using Forrest Crowell and Neal Elzenga's published measurements, that indicated you could go to 8 links full speed both ways and still be running CPU calculations at 48% efficiency between the DMA cycle stealing (see my Nov 19 9:55 AM PST message on this thread). That does not get you to 20, but hardware can be redesigned if link count increase is needed, e.g. by using 16 or 32 byte "words" in your DMA. By the way, you can get away with 4 in 3D if you have to (five tetrahedra make a cube), but it is clumsy.

You need high fanout (high-valence) for achieving general purpose connectivity. But you don’t need high valance to handle the bandwidth from a single processing engine. That’s (one reason) why the second generation transputer architecture split compute nodes and communication nodes.

You WILL run into problems with any system with limited connectivity - it doesn’t scale to large problems.

Roger

Larry






--
Roger Shepherd




Attachment: signature.asc
Description: Message signed with OpenPGP