[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Transistor count




24. nov. 2020 kl. 23:29 skrev Roger Shepherd <rog@xxxxxxxx>:

This is an interesting point. It is certainly true that it was difficult to use the transputer with a different process model. Xmos explored this in the Xcore which supported hardware processes but had a more primitive mechanism making it possible to build a variety of models on top.  I think there is scope for exploring this area further. The ability to switch contexts very cheaply changes a lot of things. Remember, the transputer could schedule a threat faster than it could multiply.

Yes, Xcore supports «hardware processes» (8 logical threads for «normal» tasks on each core, like on tile[0].core[5]). Plus hardware timers and hardware chanends. However, these are shared in rather mysterious ways, controlled to some extent by them being/having the property of combinable or distributable. An extra task does not necessarily cost and extra HW logical thread, and extra timer does not necessarily cost a HW timer and an extra channel or interface does not necessarily cost an extra chanend. They analyse state and disjointness, alias and sharing and what do I know. At the end of the day it’s quite pleasant to work with.

but had a more primitive mechanism making it possible to build a variety of models on top.

But.. what do you mean by this? Isn’t it more complex at the bottom but for me as a user, it’s simpler? Or have I sat here for too long, rotating in my own thoughts? I need some update.

Øyvind


24. nov. 2020 kl. 23:29 skrev Roger Shepherd <rog@xxxxxxxx>:



On 24 Nov 2020, at 03:47, Ruth Ivimey-Cook <ruth@xxxxxxxxxx> wrote:

On 24/11/2020 00:19, Larry Dickson wrote:

The final element of the mix would be to engineer the system such that software virtualisation of links was standard -- as was true on the transputer -- so code could think just about performing communication, not about which physical layer was involved, and also a way for the link engine to raise an exception (e.g. sw interrupt) to the processor if it cannot complete the communication 'instantly', thus potentially requiring a thread to suspend or behind  released.

I don't know, but from what I have seen so far I don't think it is worth the complexity and constraint of putting supprot for interleaved threads into the processor hardware, as the Ts did, but do feel it is valuable for the hardware to provide appropriate hooks for a light threaded kernel to do the job efficiently.

I am not following you here. Where did this weigh heavily? It never seemed much of a burden to me - much less than the burden of supporting a kernel. Basically it's interrupts (done way more cleanly than other processors), a few words of process support, and rare, simply implemented time-slicing. You cannot escape interrupts, and any kernel I ever heard of is far more onerous than this (and has horrible effects on the code design, by separating kernel from user code). What burdened the Transputer was the standard correctness checks, but if you want correct code . . . And even those could be streamlined.

The issue with the transputer design was that it fixed the implementation, so people who, for reasons they thought valid, needed more levels of priority, had to jump through lots of hoops. Kernel design in software is pretty simple and well researched now and still there are many flavours of it. That is the reason I would personally omit built-in scheduling, but provide whatever hooks were appropriate to enable it in software.

This is an interesting point. It is certainly true that it was difficult to use the transputer with a different process model. Xmos explored this in the Xcore which supported hardware processes but had a more primitive mechanism making it possible to build a variety of models on top.  I think there is scope for exploring this area further. The ability to switch contexts very cheaply changes a lot of things. Remember, the transputer could schedule a threat faster than it could multiply.

The other thing is that the transputer gained heavily from control over the instruction set, both in being stack based, and in integrating thread switch points into branches. While those could be reimplemented it would change the ISA of the target CPU, rendering massive amounts of software unusable. I would prefer to evolve a better solution.

There’s a lot going on in the transputer’s ISA that maybe isn’t obvious. Ignoring for the moment, interrupts/high-priority processes, context switches only occurred when there was very little state - the evaluation stack was empty - the context was defined by W and I. This is one reason why scheduling was fast and cheap. The need to be able to reschedule on jumps (backward jumps?) was to cause all active processes to make progress. The inclusion of scheduling operations into the ISA meant that interrupts could occur between any two instructions (let’s not get into an argument about whether prefixes were really instructions) - the scheduling instructions were essentially atomic. Now, in fact, inputs and output had to be interruptible because they continued a block copy which could be arbitrarily long. However, as only two processes were involved (channels were 1-1), and one was necessarily not active, the active process, performing the copy, could be interrupted without causing scheduling problems. 

The state of an interrupted process was much larger - you needed the full stack and some other state. 

As for correctness checks - I presume you mean the bounds checks on arrays, etc - that was a feature of occam, not the transputer, which made doing it relatively easy but dit not mandate it. And to be honest, it was not a bad decision in occam. So many of the virus hacks today would disappear if the software industry just went with mandated array bounds checks.

And you can escape interrupts - at least in the form most processors have them. Whether it is practical to do so depends on the software structures you need to port (rather than reimplement.

Ruth's proposals seem to be focused on a different set of use cases than mine, so there is room in the universe for both of us ;-) GPUs show there is room on my side, and I have a notion that study of use cases will show there is lots of room out in embedded-style hundred-thousand-core-land.
Blog: http://www.ivimey.org/blog

I was suggesting lower numbers of CPUs because I was presuming FPGA implementation on devices that cost less than a family car. Of course more would be nice... but it is also true that I think you were aiming at a more minimal implementation - something very very close to a T425 or T800 on modern silicon.

While that does appeal, I think compared to modern processors it would be outclassed (even given greater numbers) rather quickly because of the architectural and silicon improvements made since then, and because of Amdahl's law. That is, when comparing 100,000 CPUs capable of 10MIPS each (on aggregate over the whole program, not just their own flatline speed) to 2,000 CPUs capable of 1,000 MIPS each (again, on aggregate). The ever-present tension between faster and wider. Or, to put it another way, 'The Mythical Man-Month'.

I have myself wondered about massive arrays of small CPUs - I tend to think of 6502's - but experience tells me it quickly becomes very hard to effectively use such a thing, especially given the relatively small available memory and I/O bandwidths available. The only place such arrays work well that I know of are the embarrasingly parallel ones, which is why this is exactly what happens in most GPUs. The basic GPU core is often a very small core, with extremely limited capability, but replicated thousands of times. In recent generations larger cores (that I mentioned earlier) are also used, which are more capable, but with fewer of them. In the graphics workflow, the basic nodes are used for pixel colour and vertex calculations, while the larger cores are more texture based - that is, bigger picture stuff.

Another area of current interest that is embarrasingly parallel is of course neural networks/AI, which typically uses many thousands of nodes representing points on a decision tree or network. While some research groups are simulating such networks on arrays of small CPUs while trying to find good algorithms and network designs, there is a lot of effort being thrown into hard coding simple algorithms into custom circuits that can be even more efficient and be packed much more densely. I am no expert on such things, though.

Best wishes,

Ruth


--
Software Manager & Engineer
Tel: 01223 414180
Blog: http://www.ivimey.org/blog
LinkedIn: http://uk.linkedin.com/in/ruthivimeycook/


--
Roger Shepherd





Øyvind TEIG 
+47 959 615 06
oyvind.teig@xxxxxxxxxxx
https://www.teigfam.net/oyvind/home
(iMac)