[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Transistor count



Following up on Tony's numbers with a reference to Uwe's numbers and Denis's poster and Roger's comment -

Starting with what we figured before, say 100,000 for a T800 minus memory plus 50,000/KB for 4 KB memory, recalculating for Tony's 32KB gives 1,700,000 per Transputer (almost all dedicated to memory), and 10000 of them is 17B transistors - short of Nvidia's 54B (are we missing a third dimension here?). However, Uwe suggests much more weight for the links (his version dedicates half its LUTs to links, while the other consensus was less than a third, even if not counting memory), and we hear from Roger, commenting on Denis's poster, that link transistors are physically BIGGER than others. And some more transistors are needed - I don't know how many - for Tony's extra interconnects. The fact that geometry will be made simply repetitive will help, and distributed resources helps immensely.

So we have a bunch of variables here. But by backing off the 32KB memory (need analysis of use cases here) we gets lots more Transputers, almost six times as many.  When I say "use case" I think climate modeling is a good thought exercise. And it's true that Nvidia claims 5 petaflops, which works out to 500 gigaflops per core, which seems high. But these are all very creative questions.

Larry

On Nov 20, 2020, at 11:28 AM, Tony Gore <tony@xxxxxxxxxxxx> wrote:

Hi all
 
The T800 was 100mm2 in 3 micron technology. So the current bleeding edge is around 500 smaller, so you could get approx. 25,000 T800s on a chip of the same size today. Except you couldn’t in reality, because of the interconnect required. Let’s assume more layers of interconnect, and some more memory (say 32k), and you would be looking at 1,000 – 10,000. The T800 clocked at 20MHz for 10 MIPs and 10 MFlops. So take a clock rate of 2GHz, and the performance goes up 100X. So very roughly a T800 array taking the same size silicon would have 10,000 x 100 = 1 million times the raw performance coming in at a meaningless 10 teraFLOPS, and two and half B042 boards would reach a petaflop. Not that it would do much useful other than ray tracing or the Mandelbrot set.
 
People are playing around with huge amounts of simple processing embedded in the memory for some ML and AI applications and getting some great performance. There are also some new devices/building blocks around that bear a passing resemblance to the Inmos A100 as well, or at least on a cursory look there were a few familiar looking concepts in them.
 
Mind blowing really to see how performance has increased in these few decades.
 
 
Tony Gore
 
Aspen Enterprises Limited email  tony@xxxxxxxxxxxx
tel +44-1278-769008  GSM +44-7768-598570 URL:
 
Registered in England and Wales no. 3055963 Reg.Office Aspen House, Burton Row, Brent Knoll, Somerset TA9 4BW.  UK
 
 
 
From: Larry Dickson <tjoccam@xxxxxxxxxxx> 
Sent: 20 November 2020 00:57
To: Øyvind Teig <oyvind.teig@xxxxxxxxxxx>
Cc: Roger Shepherd <rog@xxxxxxxx>; Tony Gore <tony@xxxxxxxxxxxx>; David May <David.May@xxxxxxxxxxxxx>; occam-com <occam-com@xxxxxxxxxx>; Michael Bruestle <michael_bruestle@xxxxxxxxx>; Transputer TRAM <claus.meder@xxxxxxxxxxxxxx>
Subject: Re: Transistor count
 
Wow, this is a fantastic response! I had no idea there was still so much interest lurking
around. 6000-8000 seems to be the consensus and I must point out that I was
just guessing when I said 6000 - not going from real knowledge like so many
of you.
 
In any case, all these communications put together really put us in the picture.
Now we turn to the fact that Nvidia has unveiled (last May) the A100 chip with 54
billion transistors and over 10,000 GPU (AI) cores. But with our back-of-the-envelope
dreaming we could imagine 180,000 of Tony Gore's T800s in a chip with the same
transistor count . . . clocked up to modern standards . . . climate models, anyone?
 
Larry
 
On Nov 19, 2020, at 2:47 PM, Øyvind Teig <oyvind.teig@xxxxxxxxxxx> wrote:


I told about this mail til Claus Meber (claus.meder@xxxxxxxxxxxxxx - added to this thread). I knew he had some info that might interest you. Here is his response to me:
 
.- - - CLAUS MEBER START - - - 
 
Hi Øyvind,
 
Sounds interesting. I did not even know that there is still such a large community out there. If you like you can forward my mail address and the following short summary about my FPGA Transputer hobby to the interested people.
 
Short summary of my projects current status:
 
I'm using a Digilent Arty A7-100 Board. The Xilinx Vivado Tool is able to map into the Artix 7-100 chip:
  • 14 fully featured T425C with a limited number of links per transputer
  • 14 16kbytes 2 way set associative cache for each T425C as a port to external memory which is 16MB for each node of the signle 256MB DDR3 chip on the board.
  • one Xilinx MIG core for DDR3 access, connected to all the caches, access prioritized by a round-robin arbiter
  • a graphics core for VGA output 1600x1200 32 or 8 bit per pixel
  • a Ethernet MII interface 10/100 mbit which can be accessed by the root transputer only, I ported LWIP to the good old ANSI-C compiler thus having the Transputer running a Web-Server
  • the root Transputer got one link to the host which is fully compatible with the old 20Mbit Links. I connected the very nice LINKUSB which was designed and built by Mike. The rest of the Links are running at CPU clock speed with the original serial protocol consuming 11bits per byte.
  • Performance-Counters for each Transputer to get more insight which are the bottlenecks of the design
A little bit of background how I achieved it: I started my project at the end of 2018 inspired by my discovery of the Microcode ROM dump from a T425C on Gavin's web-site. I started decoding the meaning of the bits. First on a spreadsheet and later with the help of an emulator written in C. Thus I was able to bring in my ideas about how the brilliant Inmos designers might have solved their problems following the basic principle of keeping everything simple as possible. Now I can state they did a really good job. It is really not a very complex design. Many things are solved in very clever way avoiding spending too much transistors for the function needed.
End of 2019 I had enough insight how to design the T425C in FPGA technology. Unlike the original design I decided for a single clock fully synchronous design. Implementation took half a year. Testing and bug fixing took until early Summer this year. Since then I did some ANSI-C and OCCAM programming on my small "super computer". As everyone I wrote a distributed Mandelbrot calculation with my own router processes. Currently I'm running the old flight simulator and because the source is available I'm enhancing it.
A big thank you to Mike who supported me with all his great knowledge about Transputers and was a very good partner in discovering some secrets of the design.
I'm sure over the Christmas period I can write some more documentation which is always a burden for me because I'm fully satisfied if I understood and solved a problem.
To feed the discussion about resources, here is what my latest Vivado run states (synthesis and implementation set to default strategy).
<gffedllakccamkff.png>
 
Interested in clock speeds? The CPU core achieves around 80MHz for the xc7a100tcsg324-1 device. With the FPGA flooded by T425Cs it drops to 70+MHz. Luckily my Arty board can be over-clocked. The design is currently running at a 120MHz clock speed (WNS around -5ns). I tried to ask Xilinx which part I really have but unfortunately they did not grant me access to their 2D Marking Application Lounge :-(
Here is the rspy output:

rspy -l
   # Part  rt Link0 Link1 Link2 Link3
   0 T425C120 1736K   ...   10M   ...
   1 T425C120   10M   10M   10M   10M
   2 T425C120   10M   10M   ...   ...
   3 T425C120   10M   10M   ...   ...
   4 T425C120   10M   10M   ...   ...
   5 T425C120   10M   10M   ...   ...
   6 T425C120   10M   10M   ...   ...
   7 T425C120   10M   10M   ...   ...
   8 T425C120   10M   10M   ...   ...
   9 T425C120   10M   10M   10M   ...
  10 T425C120   10M   10M   10M   ...
  11 T425C120   10M   10M   10M   ...
  12 T425C120   10M   ...   10M   ...
  13 T425C120   10M   10M   10M   10M

The configuration is for running the flight simulator thus the four link connection on the last node which is the graphics node.
 
Please feel free to contact me.
 
   Claus
 
- - - CLAUS MEBER END - - - 
 
Øyvind
 


19. nov. 2020 kl. 20:28 skrev Øyvind Teig <oyvind.teig@xxxxxxxxxxx>:
 
Guys
 
 
I guess the microcode is of little help? http://transputer.net/iset/iset.asp
 
 
But the good memory plus back-of-the-envelope calculations should hit by a factor of.. better than 10?
 
(I added Michael Brüstle at transputer.net to this mail list)
 
Øyvind 


19. nov. 2020 kl. 19:54 skrev Roger Shepherd <rog@xxxxxxxx>:
 
If I recall the T4 was 25% RAM, 25% processor. 25% link and 25% other - by area (things like pads take up a lot of space but not many transistors). The RAM is much more transistor dense than the other blocks. The link block (4 bidirectional links and the event channel) is significantly less dense - the ‘register’ part is CPU like but the actually shift registers and synchronisers are very non-dense. So, perhaps 25% of the density of RAM overall. Now, doing the measurement on my photograph, it looks like 4 links occupy 2/3rds the area of the RAM which gives
 
200k * 25% * 2/3 = 33k transistor for 4 links = 8k per link (which is in line with your estimate below). I suspect your estimate is nearer to the truth than mine.
 
But in assessing anything you need to consider that the transistor count is affected by word size (two words of buffer, one word of address) and control of the interconnect to route bytes into the buffer etc. 
 
Roger


On 19 Nov 2020, at 17:55, Larry Dickson <tjoccam@xxxxxxxxxxx> wrote:
 
Hi Tony, David and all,
 
Does anyone remember how many transistors are in a link? We are
gathering information on transistor efficiency; now Tony's numbers
indicate floating point costs about 50,000, and David on memory
indicates 4KB costs about 200,000. 25,000 for CPU and 25,000 for
links would indicate 6000 per link, but that is just a guess and I
could be way off.
 
As you may be guessing, I am imagining an eight-link Transputer!
Long ago, in my PDPTA'96 Roadmap paper, I calculated "burden
bandwidth" for a one-direction link communication using Forrest
Crowell and Neal Elzenga's published measurements, and got
37MB/s, same whether links were running one-way or both-ways,
and per-unidirectional-communication timing bandwidth of 1.2 MB/s
when running both ways. This means by extrapolation that eight
links running full speed both ways would be supportable (reducing
CPU speed by 52% due to DMA burden).
 
Everything can be mapped into modern cores and communications
(e.g. Manchester code lanes); the principle stays the same.
 
Larry
 
On Mar 18, 2019, at 9:59 PM, Tony Gore <tony@xxxxxxxxxxxx> wrote:


Hi Larry

As I recall, T414 was about 250,000 and the T800 was 300,000.

Tony Gore

Tony Gore
+44 7768 598570
 

From: occam-com-request@xxxxxxxxxx <occam-com-request@xxxxxxxxxx> on behalf of Larry Dickson <tjoccam@xxxxxxxxxxx>
Sent: Tuesday, March 19, 2019 1:06:42 AM
To: Occam Family
Subject: Transistor count
 
All,

How many transistors does a Transputer have (e.g. of the T2 or T4 family)? I have heard a wide range numbers from 27,000 to 200,000, but am having trouble finding an authoritative reference.

Larry