[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Transistor count



Hi all

 

What I noticed in some the of the HPC stuff I have been on the periphery of is that some of the ideas from the late 1980s and early 1990s are cropping up again.

 

On the software side, one interesting development I came across was WebAssembly – a bit like ANDF resurrected, but now with a programmer/user pull, rather than a standards push. The guy who developed Docker containers said that if it had been around earlier, he would not have needed to invent containers

 

Roger’s comment about bulk synchronous programming reminds me of the BSP Occam all those years ago.

 

No matter how many programming models I look at, I still don’t see any that have the simplicity of the CSP model, but then I am a simple person 😊

 

I occasionally think what a 21st century occam would look like.

 

 

Tony Gore

 

Aspen Enterprises Limited email  tony@xxxxxxxxxxxx

tel +44-1278-769008  GSM +44-7768-598570 URL:

 

Registered in England and Wales no. 3055963 Reg.Office Aspen House, Burton Row, Brent Knoll, Somerset TA9 4BW.  UK

 

 

 

From: Roger Shepherd <rog@xxxxxxxx>
Sent: 23 November 2020 15:06
To: Larry Dickson <tjoccam@xxxxxxxxxxx>
Cc: Tony Gore <tony@xxxxxxxxxxxx>; Uwe Mielke <uwe.mielke@xxxxxxxxxxx>; Denis A Nicole <dan@xxxxxxxxxxxxxxx>; Øyvind Teig <oyvind.teig@xxxxxxxxxxx>; David May <David.May@xxxxxxxxxxxxx>; occam-com <occam-com@xxxxxxxxxx>; Michael Bruestle <michael_bruestle@xxxxxxxxx>; Transputer TRAM <claus.meder@xxxxxxxxxxxxxx>
Subject: Re: Transistor count

 

Larry



On 22 Nov 2020, at 23:19, Larry Dickson <tjoccam@xxxxxxxxxxx> wrote:

 

Following up on Tony's numbers with a reference to Uwe's numbers and Denis's poster and Roger's comment -

 

Starting with what we figured before, say 100,000 for a T800 minus memory plus 50,000/KB for 4 KB memory, recalculating for Tony's 32KB gives 1,700,000 per Transputer (almost all dedicated to memory), and 10000 of them is 17B transistors - short of Nvidia's 54B (are we missing a third dimension here?). However, Uwe suggests much more weight for the links (his version dedicates half its LUTs to links, while the other consensus was less than a third, even if not counting memory), and we hear from Roger, commenting on Denis's poster, that link transistors are physically BIGGER than others.

 

My comment was not about the size of the transistors; it was about the density of transistors within the real-estate dedicated to links. I don’t *know* why this is but it is likely that the density is limited by wiring. This is a perennial problem - the manufacturing process used for the transputer had (from memory) 2 layers of interconnect - 1 metal and 1 polysilicon. Modern processes have a lot more capability - perhaps 10 layers or wiring. In practice, even with this, logic density is limited by interconnect, not by transistors. “Transistors are free” isn’t quite true, but they aren’t the critical resource in modern designs. Again, the constraints on interconnect are such that local (same clock domain) is cheap and fast, non-local is expensive and slow. N.B. Throughput is cheap - “just” go parallel, it’s a matter of economics. Latency is hard, you’re up against the laws ofPhysics.

 

And some more transistors are needed - I don't know how many - for Tony's extra interconnects. The fact that geometry will be made simply repetitive will help, and distributed resources helps immensely.

 

So we have a bunch of variables here. But by backing off the 32KB memory (need analysis of use cases here) we gets lots more Transputers, almost six times as many.  When I say "use case" I think climate modeling is a good thought exercise. And it's true that Nvidia claims 5 petaflops, which works out to 500 gigaflops per core, which seems high. But these are all very creative questions.

 

“Use case” really matters. We know that for some domains, specialised machines do very well (GPUs, ML/AI-PUs) do very well. You have to get the balance between compute, store and communication right. If you’re processor is unpowered you have to use too many of them, causing you problems with storage and communication. Remember, the program has to be stored locally, and so requiring double the number of processors because of limited computation capability means twice the store dedicated to program. You’ll also need more communication capability to deal with the number of processors. It’s absolutely the case that the transputer processor is underpowered by today’s standards - I don’t know how by much. So, in your budget for this device, you need to allow many more transistors for the processor, more for RAM - I’m sure 32k is too small - and a lot more for interconnect. The system structure is likely to be “transputer” (processor, RAM and limited comms) and routers.

 

The nearest machine I know of that might have this sort of architecture is the Graphcore Colossus (https://www.graphcore.ai/products/ipu). 60B transistors, 1472 processor cores, each with 900MB. It’s comms system provides non-blocking, any communication pattern (so arguably, under-reassured as a general purpose machine - programs have to be bulk synchronous - which is a problem). 

 

Roger

 

--

Roger Shepherd