Roger and all, Clearly I am getting into the depths of this kind of design . . . maybe OUT OF my depth ;-) but I don't know yet . . . there are some commonly made assumptions that simply do not have to hold. . . . On Nov 23, 2020, at 1:43 PM, Roger Shepherd < rog@xxxxxxxx> wrote: Larry
On 23 Nov 2020, at 16:04, Larry Dickson <tjoccam@xxxxxxxxxxx> wrote:
Thank you, Roger, for all the valuable considerations.
On Nov 23, 2020, at 7:06 AM, Roger Shepherd <rog@xxxxxxxx> wrote:
Larry
On 22 Nov 2020, at 23:19, Larry Dickson <tjoccam@xxxxxxxxxxx> wrote:
Following up on Tony's numbers with a reference to Uwe's numbers and Denis's poster and Roger's comment -
Starting with what we figured before, say 100,000 for a T800 minus memory plus 50,000/KB for 4 KB memory, recalculating for Tony's 32KB gives 1,700,000 per Transputer (almost all dedicated to memory), and 10000 of them is 17B transistors - short of Nvidia's 54B (are we missing a third dimension here?). However, Uwe suggests much more weight for the links (his version dedicates half its LUTs to links, while the other consensus was less than a third, even if not counting memory), and we hear from Roger, commenting on Denis's poster, that link transistors are physically BIGGER than others.
My comment was not about the size of the transistors; it was about the density of transistors within the real-estate dedicated to links. I don’t *know* why this is but it is likely that the density is limited by wiring. This is a perennial problem - the manufacturing process used for the transputer had (from memory) 2 layers of interconnect - 1 metal and 1 polysilicon. Modern processes have a lot more capability - perhaps 10 layers or wiring. In practice, even with this, logic density is limited by interconnect, not by transistors. “Transistors are free” isn’t quite true, but they aren’t the critical resource in modern designs. Again, the constraints on interconnect are such that local (same clock domain) is cheap and fast, non-local is expensive and slow.
Purely distributed design, like networks of Transputers, then has a big advantage IN PRINCIPLE.
N.B. Throughput is cheap - “just” go parallel, it’s a matter of economics. Latency is hard, you’re up against the laws ofPhysics.
It shouldn't necessarily be that bad - IN PRINCIPLE, again, it's a "logarithm" not a "square root" - you use hyperbolic geometry (not Euclidean geometry) for your network. This can be done with as few as three links per node (triangular grid, but more than six triangles around a point - three, four, or five gives you Platonic solid, six gives you Euclidean, more than six hyperbolic). Of course you have to fit them on the die somehow ;-)
You need high fanout to avoid too many intermediate steps.
Not sure what you mean here. Doesn't fanout mean you are driving them all at once? But even that would follow a log law. Euclidean disk area pi*r^2, if area is 100,000 diameter is ~357, but hyperbolic disk area pi*(2sinh(r/2))^2, if area is 100,000 diameter is ~21. You have lots of link engines but few per node.
I meant that the communication nodes should have “high valency”. That is, one input should be switchable to one of a large number of outputs. High-valency decreases the number of hops needed which reduces the amount of latency in communication which decreases the amount of “excess parallelism” needed which reduces the size of problem you can parallelise.
And some more transistors are needed - I don't know how many - for Tony's extra interconnects. The fact that geometry will be made simply repetitive will help, and distributed resources helps immensely.
So we have a bunch of variables here. But by backing off the 32KB memory (need analysis of use cases here) we gets lots more Transputers, almost six times as many. When I say "use case" I think climate modeling is a good thought exercise. And it's true that Nvidia claims 5 petaflops, which works out to 500 gigaflops per core, which seems high. But these are all very creative questions.
“Use case” really matters. We know that for some domains, specialised machines do very well (GPUs, ML/AI-PUs) do very well. You have to get the balance between compute, store and communication right. If you’re processor is unpowered you have to use too many of them, causing you problems with storage and communication. Remember, the program has to be stored locally, and so requiring double the number of processors because of limited computation capability means twice the store dedicated to program.
I think we can get smart here by giving some ground on "locally". Remember, in CSP any number of processes can share READ-ONLY memory, so you can have a sequence of "loading state" and "running state" (like the Transputer worm), and during running state a big block of read-only memory with the code is shared by, say, 100 nodes (each running the same, or almost the same, program). This requires a bit of design attention, because computer science says "any number read in parallel" but in the real world some sequences are involved.
Don’t believe this would work for independent processors. If they operate in a SIMD-like manner then may be - but you carry the problem every processor displaying worst case behaviour. You can’t share memory - that’s why processors are typically tightly coupled to their I-caches.
I am assuming the 100 or so that share read-only memory are on the same clock. This would be a NUMA setup of a very simple kind. Blocks of 100 would communicate with nearest-neighbor blocks only and via channel semantics. If hyperbolic grid is used, lots of nearest neighbors, it is true - tradeoff with short diameter. Read-only cache design would be part of it; FIFOs not so bad for writes because node neighbor channels only.
It’s hard to build a memory that isn’t saturated by one processor. That’s why modern machines have separate I and D caches. 100 processors sharing a memory sounds tough.
Use cases would be at the center because there would be a manufacturing process that cheaply varies parameters to create a chip optimized for any given use case.
You’ll also need more communication capability to deal with the number of processors. It’s absolutely the case that the transputer processor is underpowered by today’s standards - I don’t know how by much.
I wonder if this is true - if you analyze it in time units of clock cycles per single core. I don't think it is true, if you analyze it in clock cycles per million transistors.
What are you trying to do? If you are trying to get 2 processors to solve a problem faster than 1 as a reasonable cost - then you need to be using your resources quite well - one limited resource is RAM - you probably need to work it hard. 32-cycles for a 32-bit multiply doesn’t cut it against a 1-multiply per cycle processor. The thing is, the cost of certain useful functions (multiplication) are pretty cheap compared with the cost of a processor; not being reasonably competitive on performance means you need too many processors to solve the problem. Now exactly how much processing you need to get a balanced system I do not know - but faster processors mean less of them, means less communication infrastructure, less total RAM
We can do use-case-based analysis and see what needs extra capability, in some or all processors, according to the formula you are alluding to. Lots of times it is just Linpack. My guess (hope) is that for lots of science, per-node writable memory is rather small. If the use case is unfriendly, everything talks to everything else all the time, then probably no go. But climate modeling and other physics are pretty local. The massively parallel approach is natural for fine grids - and climate models stumble because grids are too coarse.
I think the models being used in these area have moved on from being based on simple grids...
That is where design on the program level comes in. Hyperbolic networks can be used for comms, though path length matters, and some fancy footwork to avoid echoes. Again, use cases matter. My notion only needs about 100 nodes to be synchronous,
Do you mean synchronous? Same clock? Pretty tough if they are fast nodes.
I mean the 100 nodes are synchronous to each other but NUMA to other (neighbor) blocks of nodes (see above). The idea is to slim things way down, as the use case allows. Repetitive, local, no "long reach" stuff - simplifies manufacturing too, allowing varied parameters.
100 nodes synchronous mean they are each very small or very slow.
Roger Larry Roger
but maybe could go to 100,000 or so (conceptually) with a little work on a die.
Larry
Roger
-- Roger Shepherd rog@xxxxxxxx
-- Roger Shepherd rog@xxxxxxxx
|