[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Transistor count
On 24/11/2020 00:19, Larry Dickson wrote:
The other side of the coin would be that if the link engine(s) were
essentially all wormhole routers, as for the C104 router chips,
complete with packet addressing. Thus the link coprocessor would
essentially become some number of interfaces directly to the CPU plus
some number of interfaces to the external world, with a crossbar in
the middle. This would massively increase the communications
effectiveness of the design, and while taking up much more silicon
area, I believe it would be an overall benefit for any non-trivial
system. One net result is the elimination of the massive amount of
'receive and pass on' routing code that used to be needed with a
directly connected link design.
An excellent point. But we need to remain use-case-sensitive. Some
physics is so simple that perhaps the main effort and its
communications would be so standard that little such direction mapping
code would be needed - and then the overhead of the wormhole stuff
could be a negative.
For something simple and regular you're going to be much better off with
a GPU - their overwhelming strength is highly parallel tasks of that
sort. What I outlined was intended as a more generalised solution where
the CPUs are cooperating on a goal but definitely doing different things.
Different kinds of links/channels can branch out in even more
directions that have never been explored, like hybrid between soft and
hard (NUMA). Anything that acts like a channel may be our friend.
I'm not following you in your comparison here ... link types and NUMA??
Perhaps you could elaborate.
The final element of the mix would be to engineer the system such
that software virtualisation of links was standard -- as was true on
the transputer -- so code could think just about performing
communication, not about which physical layer was involved, and also
a way for the link engine to raise an exception (e.g. sw interrupt)
to the processor if it cannot complete the communication 'instantly',
thus potentially requiring a thread to suspend or be released.
I don't know, but from what I have seen so far I don't think it is
worth the complexity and constraint of putting supprot for
interleaved threads into the processor hardware, as the Ts did, but
do feel it is valuable for the hardware to provide appropriate hooks
for a light threaded kernel to do the job efficiently.
I am not following you here. Where did this weigh heavily? It never
seemed much of a burden to me - much less than the burden of
supporting a kernel. Basically it's interrupts (done way more cleanly
than other processors), a few words of process support, and rare,
simply implemented time-slicing. You cannot escape interrupts, and any
kernel I ever heard of is far more onerous than this (and has horrible
effects on the code design, by separating kernel from user code). What
burdened the Transputer was the standard correctness checks, but if
you want correct code . . . And even those could be streamlined.
The issue with the transputer design was that it fixed the
implementation, so people who, for reasons they thought valid, needed
more levels of priority, had to jump through lots of hoops. Kernel
design in software is pretty simple and well researched now and still
there are many flavours of it. That is the reason I would personally
omit built-in scheduling, but provide whatever hooks were appropriate to
enable it in software.
The other thing is that the transputer gained heavily from control over
the instruction set, both in being stack based, and in integrating
thread switch points into branches. While those could be reimplemented
it would change the ISA of the target CPU, rendering massive amounts of
software unusable. I would prefer to evolve a better solution.
As for correctness checks - I presume you mean the bounds checks on
arrays, etc - that was a feature of occam, not the transputer, which
made doing it relatively easy but dit not mandate it. And to be honest,
it was not a bad decision in occam. So many of the virus hacks today
would disappear if the software industry just went with mandated array
bounds checks.
Ruth's proposals seem to be focused on a different set of use cases
than mine, so there is room in the universe for both of us ;-) GPUs
show there is room on my side, and I have a notion that study of use
cases will show there is lots of room out in embedded-style
hundred-thousand-core-land.
Blog: http://www.ivimey.org/blog
I was suggesting lower numbers of CPUs because I was presuming FPGA
implementation on devices that cost less than a family car. Of course
more would be nice... but it is also true that I think you were aiming
at a more minimal implementation - something very very close to a T425
or T800 on modern silicon.
While that does appeal, I think compared to modern processors it would
be outclassed (even given greater numbers) rather quickly because of the
architectural and silicon improvements made since then, and because of
Amdahl's law. That is, when comparing 100,000 CPUs capable of 10MIPS
each (on aggregate over the whole program, not just their own flatline
speed) to 2,000 CPUs capable of 1,000 MIPS each (again, on aggregate).
The ever-present tension between faster and wider. Or, to put it another
way, 'The Mythical Man-Month'.
I have myself wondered about massive arrays of small CPUs - I tend to
think of 6502's - but experience tells me it quickly becomes very hard
to effectively use such a thing, especially given the relatively small
available memory and I/O bandwidths available. The only place such
arrays work well that I know of are the embarrasingly parallel ones,
which is why this is exactly what happens in most GPUs. The basic GPU
core is often a very small core, with extremely limited capability, but
replicated thousands of times. In recent generations larger cores (that
I mentioned earlier) are also used, which are more capable, but with
fewer of them. In the graphics workflow, the basic nodes are used for
pixel colour and vertex calculations, while the larger cores are more
texture based - that is, bigger picture stuff.
Another area of current interest that is embarrasingly parallel is of
course neural networks/AI, which typically uses many thousands of nodes
representing points on a decision tree or network. While some research
groups are simulating such networks on arrays of small CPUs while trying
to find good algorithms and network designs, there is a lot of effort
being thrown into hard coding simple algorithms into custom circuits
that can be even more efficient and be packed much more densely. I am no
expert on such things, though.
Best wishes,
Ruth
--
Software Manager & Engineer
Tel: 01223 414180
Blog: http://www.ivimey.org/blog
LinkedIn: http://uk.linkedin.com/in/ruthivimeycook/