Re: Transistor count

On Nov 25, 2020, at 2:00 AM, Roger Shepherd <rog@xxxxxxxx> wrote:

Larry,

On 25 Nov 2020, at 01:19, Larry Dickson <tjoccam@xxxxxxxxxxx> wrote:

Roger,

I THINK I understand what you mean by high valency, something like a select() over a large number of channels (or parallel output to a large number of channels), but it simply is not required if you can accept logarithmic delays. For instance, a triangular array has 90,000 leaf nodes (300 squared), but the fourth link on each leaf connects to a twig, three leaves to a twig, and the fourth link on each twig connects to a branchlet, etc. 45,005 gatherers stream all the leaves' data to one point in 11 steps and nobody has a fanout of more than 4, or needs to communicate with anything distant.

I don’t understand. I’m missing the model of computation you are thinking of. I believe that interesting problems don’t map on to simple structures

I believe most of them do - climate modeling, classical physics of structures, AI, fluid flow . . . all local. Others have brought up quantum problems which do have nontrivial action at a distance, but that is just one topic of many. Whether "local" = "simple structures" is of course subject to dispute, but I think with some creativity it can be made to work. (One of the "basic problems" . . . )

and the only way to make those problems programmable is to support an abstract model of programming where you’re able to address problems with non-local communication.

That amounts to accepting defeat in my opinion. It requires centralization of everything and reduces effectiveness by multiple orders of magnitude. (Compare (clock speed)*(transistor count) increase since, say, 1985, to the increase in what computers can actually do. Almost all the advances have been in raw repetition, like screen resolution and data transmission speeds - totally "simple structures".)

Proposals were made around 1990 for how to do this and how to hide any mismatch between the physical connectivity of the world and the logical connectivity of a program. There’s also a lot of understanding of how communication networks behave and what we need to do to make them effective. Low connectivity leads to longer latency which leads to a requirement for a high degree of excess parallelism, which increases the size of the problem you can tackle. Or to put it another way, limits the benefit if parallelism for small problems. This is a practical, engineering issue.

Agreed, but mostly people have given up rather than attempting to solve it. The exception is GPUs (and maybe AI?) and GPUs have been pretty successful. The alternative, as you say, is "an abstract model of programming" which has imposed something like a fifth root law on our computing progress (i.e. resources increase by 100,000 and usable power increases by 10). Breaking out of this trap deserves some effort.

Of course I am ignoring data throughput questions, but they can be dealt with using extra gatherers without increasing the logarithmic latency. It scales every day to large problems - check out your high-resolution GPU.

As for 100 nodes sharing read-only code, given an instruction cache on each node, it ought to be doable with FIFOs. Has anyone really tried? It seems to be no different from lots of processes accessing a solid state disk, and we do that all the time.

When do we do this? Sure there can be a lot of operating system processes accessing a disk, but they absolutely aren’t running at the same time - and if you can happily time multiplex a processor why replicate it?

It is purely a question of resources. I am thinking of "big code, small data" problems (data means alterable data), into which class many scientific problems fall.

Here's why I believe it is a soluble problem. Think of 100 people at a conference watching a slide show on a big screen. They all view simultaneously, and can focus on the same part of the slide, or different parts, or both, with no interference. We all take this for granted, but from an information flow point of view it is interesting. A lot of photons are flying. My notion is to do something analogous with read-only memory, and we have quite a bit of resources to work with - say 30% of the local resources of 100 processors - the idea is to make it happen very fast with resources that are small per processor, but may be considerable when totaled over 100 processors.

Larry

Roger

Larry

On Nov 24, 2020, at 2:37 PM, Roger Shepherd <rog@xxxxxxxx> wrote:

Larry

On 24 Nov 2020, at 16:56, Larry Dickson <tjoccam@xxxxxxxxxxx> wrote:

======
I also think the 4 links is not enough. 6 should be a minimum for 3D simulations and 10-20 would be better more up to date unstructured problems.
======

I did some work in 1996 on classic Transputers, using Forrest Crowell and Neal Elzenga's published measurements, that indicated you could go to 8 links full speed both ways and still be running CPU calculations at 48% efficiency between the DMA cycle stealing (see my Nov 19 9:55 AM PST message on this thread). That does not get you to 20, but hardware can be redesigned if link count increase is needed, e.g. by using 16 or 32 byte "words" in your DMA. By the way, you can get away with 4 in 3D if you have to (five tetrahedra make a cube), but it is clumsy.

You need high fanout (high-valence) for achieving general purpose connectivity. But you don’t need high valance to handle the bandwidth from a single processing engine. That’s (one reason) why the second generation transputer architecture split compute nodes and communication nodes.

You WILL run into problems with any system with limited connectivity - it doesn’t scale to large problems.

Roger

Larry

--
Roger Shepherd
rog@xxxxxxxx