[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Transistor count

To: occam-com <occam-com@xxxxxxxxxx>, Tony Gore <tony@xxxxxxxxxxxx>, Roger Shepherd <rog@xxxxxxxx>, Denis A Nicole <dan@xxxxxxxxxxxxxxx>
Subject: Re: Transistor count
From: Ruth Ivimey-Cook <ruth@xxxxxxxxxx>
Date: Tue, 24 Nov 2020 03:47:03 +0000
Archived-at: <https://lists.kent.ac.uk/sympa/arcsearch_id/occam-com/2020-11/d0547892-0fd3-3dca-0d3d-0272d2fb760f%40ivimey.org>
Delivery-date: Tue, 24 Nov 2020 03:47:13 +0000
Envelope-to: ats@xxxxxxxxx
In-reply-to: <05D297F6-7933-4349-876B-573D7A26D1DD@tjoccam.com>
List-archive: <https://lists.kent.ac.uk/sympa/arc/occam-com>
List-help: <mailto:sympa@kent.ac.uk?subject=help>
List-id: <occam-com.kent.ac.uk>
List-owner: <mailto:occam-com-request@kent.ac.uk>
List-post: <mailto:occam-com@kent.ac.uk>
List-subscribe: <mailto:sympa@kent.ac.uk?subject=subscribe%20occam-com>
List-unsubscribe: <mailto:sympa@kent.ac.uk?subject=unsubscribe%20occam-com>
References: <C873035A-2E4D-4079-A7BA-D02635B6558E@tjoccam.com> <AM6PR05MB6199DED16E35E9FE704FB622E0400@AM6PR05MB6199.eurprd05.prod.outlook.com> <2C8378F8-1237-46E9-A9A0-E6034B85C050@tjoccam.com> <7DE3278B-D84D-4CEA-B90F-75FFB28D7D57@rcjd.net> <EF490EDC-611F-44BD-879B-95923FB47496@teigfam.net> <6364CB26-883F-4B91-88B7-997DDCC49760@teigfam.net> <6A533325-949F-425A-9A3B-0400B3CE4F7D@tjoccam.com> <VI1PR05MB5903F0DB8738AF40CABF60D3E0FF0@VI1PR05MB5903.eurprd05.prod.outlook.com> <C9038418-4AD5-4DB3-A7AC-2C9799242792@tjoccam.com> <6E379737-0600-45AF-BFC5-073A3526C2C2@rcjd.net> <F881DCB9-A9EA-468F-88E3-CDA3CB457FBF@tjoccam.com> <DBE32399-FF55-4F22-A847-A37F2E5DF3C1@rcjd.net> <2222a3c3-4e6f-cfee-e5ce-24c65b1ee06d@ivimey.org> <05D297F6-7933-4349-876B-573D7A26D1DD@tjoccam.com>
Reply-to: Ruth Ivimey-Cook <ruth@xxxxxxxxxx>
Sender: occam-com-request@xxxxxxxxxx
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:68.0) Gecko/20100101 Thunderbird/68.0

On 24/11/2020 00:19, Larry Dickson wrote:

The other side of the coin would be that if the link engine(s) wereessentially all wormhole routers, as for the C104 router chips,complete with packet addressing. Thus the link coprocessor wouldessentially become some number of interfaces directly to the CPU plussome number of interfaces to the external world, with a crossbar inthe middle. This would massively increase the communicationseffectiveness of the design, and while taking up much more siliconarea, I believe it would be an overall benefit for any non-trivialsystem. One net result is the elimination of the massive amount of'receive and pass on' routing code that used to be needed with adirectly connected link design.
An excellent point. But we need to remain use-case-sensitive. Somephysics is so simple that perhaps the main effort and itscommunications would be so standard that little such direction mappingcode would be needed - and then the overhead of the wormhole stuffcould be a negative.

For something simple and regular you're going to be much better off witha GPU - their overwhelming strength is highly parallel tasks of thatsort. What I outlined was intended as a more generalised solution wherethe CPUs are cooperating on a goal but definitely doing different things.

Different kinds of links/channels can branch out in even moredirections that have never been explored, like hybrid between soft andhard (NUMA). Anything that acts like a channel may be our friend.

I'm not following you in your comparison here ... link types and NUMA??Perhaps you could elaborate.

The final element of the mix would be to engineer the system suchthat software virtualisation of links was standard -- as was true onthe transputer -- so code could think just about performingcommunication, not about which physical layer was involved, and alsoa way for the link engine to raise an exception (e.g. sw interrupt)to the processor if it cannot complete the communication 'instantly',thus potentially requiring a thread to suspend or be released.
I don't know, but from what I have seen so far I don't think it isworth the complexity and constraint of putting supprot forinterleaved threads into the processor hardware, as the Ts did, butdo feel it is valuable for the hardware to provide appropriate hooksfor a light threaded kernel to do the job efficiently.
I am not following you here. Where did this weigh heavily? It neverseemed much of a burden to me - much less than the burden ofsupporting a kernel. Basically it's interrupts (done way more cleanlythan other processors), a few words of process support, and rare,simply implemented time-slicing. You cannot escape interrupts, and anykernel I ever heard of is far more onerous than this (and has horribleeffects on the code design, by separating kernel from user code). Whatburdened the Transputer was the standard correctness checks, but ifyou want correct code . . . And even those could be streamlined.

The issue with the transputer design was that it fixed theimplementation, so people who, for reasons they thought valid, neededmore levels of priority, had to jump through lots of hoops. Kerneldesign in software is pretty simple and well researched now and stillthere are many flavours of it. That is the reason I would personallyomit built-in scheduling, but provide whatever hooks were appropriate toenable it in software.

The other thing is that the transputer gained heavily from control overthe instruction set, both in being stack based, and in integratingthread switch points into branches. While those could be reimplementedit would change the ISA of the target CPU, rendering massive amounts ofsoftware unusable. I would prefer to evolve a better solution.

As for correctness checks - I presume you mean the bounds checks onarrays, etc - that was a feature of occam, not the transputer, whichmade doing it relatively easy but dit not mandate it. And to be honest,it was not a bad decision in occam. So many of the virus hacks todaywould disappear if the software industry just went with mandated arraybounds checks.

Ruth's proposals seem to be focused on a different set of use casesthan mine, so there is room in the universe for both of us ;-) GPUsshow there is room on my side, and I have a notion that study of usecases will show there is lots of room out in embedded-stylehundred-thousand-core-land.
Blog: http://www.ivimey.org/blog

I was suggesting lower numbers of CPUs because I was presuming FPGAimplementation on devices that cost less than a family car. Of coursemore would be nice... but it is also true that I think you were aimingat a more minimal implementation - something very very close to a T425or T800 on modern silicon.

While that does appeal, I think compared to modern processors it wouldbe outclassed (even given greater numbers) rather quickly because of thearchitectural and silicon improvements made since then, and because ofAmdahl's law. That is, when comparing 100,000 CPUs capable of 10MIPSeach (on aggregate over the whole program, not just their own flatlinespeed) to 2,000 CPUs capable of 1,000 MIPS each (again, on aggregate).The ever-present tension between faster and wider. Or, to put it anotherway, 'The Mythical Man-Month'.

I have myself wondered about massive arrays of small CPUs - I tend tothink of 6502's - but experience tells me it quickly becomes very hardto effectively use such a thing, especially given the relatively smallavailable memory and I/O bandwidths available. The only place sucharrays work well that I know of are the embarrasingly parallel ones,which is why this is exactly what happens in most GPUs. The basic GPUcore is often a very small core, with extremely limited capability, butreplicated thousands of times. In recent generations larger cores (thatI mentioned earlier) are also used, which are more capable, but withfewer of them. In the graphics workflow, the basic nodes are used forpixel colour and vertex calculations, while the larger cores are moretexture based - that is, bigger picture stuff.

Another area of current interest that is embarrasingly parallel is ofcourse neural networks/AI, which typically uses many thousands of nodesrepresenting points on a decision tree or network. While some researchgroups are simulating such networks on arrays of small CPUs while tryingto find good algorithms and network designs, there is a lot of effortbeing thrown into hard coding simple algorithms into custom circuitsthat can be even more efficient and be packed much more densely. I am noexpert on such things, though.


Best wishes,

Ruth


--
Software Manager & Engineer
Tel: 01223 414180
Blog: http://www.ivimey.org/blog
LinkedIn: http://uk.linkedin.com/in/ruthivimeycook/

Follow-Ups:
- Re: Transistor count
  - From: Roger Shepherd

References:
- Transistor count
  - From: Larry Dickson
- Re: Transistor count
  - From: Tony Gore
- Re: Transistor count
  - From: Larry Dickson
- Re: Transistor count
  - From: Roger Shepherd
- Re: Transistor count
  - From: Øyvind Teig
- Re: Transistor count
  - From: Øyvind Teig
- Re: Transistor count
  - From: Larry Dickson
- RE: Transistor count
  - From: Tony Gore
- Re: Transistor count
  - From: Larry Dickson
- Re: Transistor count
  - From: Roger Shepherd
- Re: Transistor count
  - From: Larry Dickson
- Re: Transistor count
  - From: Roger Shepherd
- Re: Transistor count
  - From: Ruth Ivimey-Cook
- Re: Transistor count
  - From: Larry Dickson

Prev by Date: Re: Transistor count
Next by Date: RE: Transistor count
Previous by thread: Re: Transistor count
Next by thread: Re: Transistor count
Index(es):
- Date
- Thread