[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Transistor count



On 24/11/2020 00:19, Larry Dickson wrote:
The other side of the coin would be that if the link engine(s) were essentially all wormhole routers, as for the C104 router chips, complete with packet addressing. Thus the link coprocessor would essentially become some number of interfaces directly to the CPU plus some number of interfaces to the external world, with a crossbar in the middle. This would massively increase the communications effectiveness of the design, and while taking up much more silicon area, I believe it would be an overall benefit for any non-trivial system. One net result is the elimination of the massive amount of 'receive and pass on' routing code that used to be needed with a directly connected link design.
An excellent point. But we need to remain use-case-sensitive. Some physics is so simple that perhaps the main effort and its communications would be so standard that little such direction mapping code would be needed - and then the overhead of the wormhole stuff could be a negative.
For something simple and regular you're going to be much better off with a GPU - their overwhelming strength is highly parallel tasks of that sort. What I outlined was intended as a more generalised solution where the CPUs are cooperating on a goal but definitely doing different things.

Different kinds of links/channels can branch out in even more directions that have never been explored, like hybrid between soft and hard (NUMA). Anything that acts like a channel may be our friend.

I'm not following you in your comparison here ... link types and NUMA?? Perhaps you could elaborate.


The final element of the mix would be to engineer the system such that software virtualisation of links was standard -- as was true on the transputer -- so code could think just about performing communication, not about which physical layer was involved, and also a way for the link engine to raise an exception (e.g. sw interrupt) to the processor if it cannot complete the communication 'instantly', thus potentially requiring a thread to suspend or be released.

I don't know, but from what I have seen so far I don't think it is worth the complexity and constraint of putting supprot for interleaved threads into the processor hardware, as the Ts did, but do feel it is valuable for the hardware to provide appropriate hooks for a light threaded kernel to do the job efficiently.

I am not following you here. Where did this weigh heavily? It never seemed much of a burden to me - much less than the burden of supporting a kernel. Basically it's interrupts (done way more cleanly than other processors), a few words of process support, and rare, simply implemented time-slicing. You cannot escape interrupts, and any kernel I ever heard of is far more onerous than this (and has horrible effects on the code design, by separating kernel from user code). What burdened the Transputer was the standard correctness checks, but if you want correct code . . . And even those could be streamlined.

The issue with the transputer design was that it fixed the implementation, so people who, for reasons they thought valid, needed more levels of priority, had to jump through lots of hoops. Kernel design in software is pretty simple and well researched now and still there are many flavours of it. That is the reason I would personally omit built-in scheduling, but provide whatever hooks were appropriate to enable it in software.

The other thing is that the transputer gained heavily from control over the instruction set, both in being stack based, and in integrating thread switch points into branches. While those could be reimplemented it would change the ISA of the target CPU, rendering massive amounts of software unusable. I would prefer to evolve a better solution.

As for correctness checks - I presume you mean the bounds checks on arrays, etc - that was a feature of occam, not the transputer, which made doing it relatively easy but dit not mandate it. And to be honest, it was not a bad decision in occam. So many of the virus hacks today would disappear if the software industry just went with mandated array bounds checks.


Ruth's proposals seem to be focused on a different set of use cases than mine, so there is room in the universe for both of us ;-) GPUs show there is room on my side, and I have a notion that study of use cases will show there is lots of room out in embedded-style hundred-thousand-core-land.
Blog: http://www.ivimey.org/blog

I was suggesting lower numbers of CPUs because I was presuming FPGA implementation on devices that cost less than a family car. Of course more would be nice... but it is also true that I think you were aiming at a more minimal implementation - something very very close to a T425 or T800 on modern silicon.

While that does appeal, I think compared to modern processors it would be outclassed (even given greater numbers) rather quickly because of the architectural and silicon improvements made since then, and because of Amdahl's law. That is, when comparing 100,000 CPUs capable of 10MIPS each (on aggregate over the whole program, not just their own flatline speed) to 2,000 CPUs capable of 1,000 MIPS each (again, on aggregate). The ever-present tension between faster and wider. Or, to put it another way, 'The Mythical Man-Month'.

I have myself wondered about massive arrays of small CPUs - I tend to think of 6502's - but experience tells me it quickly becomes very hard to effectively use such a thing, especially given the relatively small available memory and I/O bandwidths available. The only place such arrays work well that I know of are the embarrasingly parallel ones, which is why this is exactly what happens in most GPUs. The basic GPU core is often a very small core, with extremely limited capability, but replicated thousands of times. In recent generations larger cores (that I mentioned earlier) are also used, which are more capable, but with fewer of them. In the graphics workflow, the basic nodes are used for pixel colour and vertex calculations, while the larger cores are more texture based - that is, bigger picture stuff.

Another area of current interest that is embarrasingly parallel is of course neural networks/AI, which typically uses many thousands of nodes representing points on a decision tree or network. While some research groups are simulating such networks on arrays of small CPUs while trying to find good algorithms and network designs, there is a lot of effort being thrown into hard coding simple algorithms into custom circuits that can be even more efficient and be packed much more densely. I am no expert on such things, though.

Best wishes,

Ruth


--
Software Manager & Engineer
Tel: 01223 414180
Blog: http://www.ivimey.org/blog
LinkedIn: http://uk.linkedin.com/in/ruthivimeycook/