RE: Transistor count

Dear Chris and others I remember from long ago

Please excuse my interruption, and I will have missed out plenty of your interesting discussion.

What strikes me is that latency is the bottleneck. Iann Barron commented at least ten years ago that he regrets that the transputer did not have a routing switch inside the transputer for the links. It will use more silicon, but the routing switch takes a great deal out of the software needed for the inevitable paths through so many cores. I chose 4Links for my company’s name, and we have made quite a few designs of 4-port routing switches, but I would like a minimum of 6-ports and at one point we had a 16-port design, although we never used it.

If I’ve misunderstood, please ignore, but I got interested in communications links in the 1960s when I was designing core memories, and I was always being asked for faster and faster memory, when the interface between processor and memory was a much longer delay than the memory itself.

Very best wishes to all
Paul Walker

From: occam-com-request@xxxxxxxxxx <occam-com-request@xxxxxxxxxx> On Behalf Of Jones, Chris C (UK Warton)
Sent: 24 November 2020 10:13
To: Ruth Ivimey-Cook <ruth@xxxxxxxxxx>; occam-com@xxxxxxxxxx; Larry Dickson <tjoccam@xxxxxxxxxxx>; Tony Gore <tony@xxxxxxxxxxxx>
Cc: Uwe Mielke <uwe.mielke@xxxxxxxxxxx>; Denis A Nicole <dan@xxxxxxxxxxxxxxx>; Øyvind Teig <oyvind.teig@xxxxxxxxxxx>; David May <David.May@xxxxxxxxxxxxx>; Michael Bruestle <michael_bruestle@xxxxxxxxx>; Transputer TRAM <claus.meder@xxxxxxxxxxxxxx>
Subject: RE: Transistor count

This email may contain proprietary information of BAE Systems and/or third parties.

On Ruth’s ideal transputer:

I spend my life, or it seems a significant part of it, doing electromagnetic interaction simulations. We run on up to 2000 cores at a time and each simulation takes from 3 to 15 days. I would like to run much faster and use many more cores, but the lack of sufficient cache ram on the processors stops this. In fact, when I say 2000 cores, I mean I monopolise 2000 cores, and I actually am running on half or fewer. When we benchmark the code (which only a couple of years ago was considered “embarrassingly parallel”) we find the code – running our problems – runs faster up to around 1600 cores but with fewer and fewer cores per processor in operation. Above this the code runs more and more slowly until, at about 2000 cores it is not worth adding more, and beyond this the performance drops catastrophically. The unused cores cannot be used by other processes without inflicting the same performance penalties. This appears to be almost entirely a limit on the cache memory available to each processor. Thus indicating that 2MB per core is insufficient and the beneficial effects of more cache increases to about 6 or 8 with a sweet spot around 4 to 5MBs.

I do not for one moment assume this is the same for all computing problems, indeed, I think it is very specific to particular codes running particular problems, but I do think it is illustrative of the problem, and of the immense importance of sufficient on-processor fast memory. I am looking forward to benchmarking the later AMD chips with much more cache.

I also think the 4 links is not enough. 6 should be a minimum for 3D simulations and 10-20 would be better more up to date unstructured problems.

…. Just one user’s view.

Regards,

Chris

Prof. Christopher C R Jones BSc. PhD C.Eng. FIET

BAE Systems Engineering Fellow

EMP Fellow of the Summa Foundation

Principal Technologist – Electromagnetics

Military Air & Information ( Direct: +44 (0) 3300 477425

Electromagnetic Engineering, W423A ( Mobile: +44 (0)7855 393833

Engineering Integrated Solutions 7 Fax: +44 (0)1772 8 55262

Warton Aerodrome * E-mail: chris.c.jones@xxxxxxxxxxxxxx

Preston : Web: www.baesystems.com

PR4 1AX

BAE Systems (Operations) Limited
Registered Office: Warwick House, PO Box 87, Farnborough Aerospace Centre, Farnborough, Hants, GU14 6YU, UK
Registered in England & Wales No: 1996687
Exported from the United Kingdom under the terms of the UK Export Control Act 2002 (DEAL No 8106)

From: occam-com-request@xxxxxxxxxx [mailto:occam-com-request@xxxxxxxxxx] On Behalf Of Ruth Ivimey-Cook
Sent: 24 November 2020 00:20
To: occam-com@xxxxxxxxxx; Larry Dickson; Tony Gore
Cc: Uwe Mielke; Denis A Nicole; Øyvind Teig; David May; occam-com; Michael Bruestle; Transputer TRAM
Subject: Re: Transistor count

PHISHING ALERT

This email has been sent from an account outside of the BAE Systems network.

Please treat the email with caution, especially if you are requested to click on a link or open an attachment.
For further information on how to spot and report a phishing email please access the Global Intranet then select <Functions> / <IT>.

I've been following this discussion with great interest.

On 23/11/2020 21:43, Roger Shepherd wrote:

I think we can get smart here by giving some ground on "locally". Remember, in CSP any number of processes can share READ-ONLY memory, so you can have a sequence of "loading state" and "running state" (like the Transputer worm), and during running state a big block of read-only memory with the code is shared by, say, 100 nodes (each running the same, or almost the same, program). This requires a bit of design attention, because computer science says "any number read in parallel" but in the real world some sequences are involved.

Don’t believe this would work for independent processors. If they operate in a SIMD-like manner then may be - but you carry the problem every processor displaying worst case behaviour. You can’t share memory - that’s why processors are typically tightly coupled to their I-caches.

I agree that memory needs to be very local, or the processor will be starved of work. The harder problem is how much memory needs to be local. Modern highly parallel processors are showing some interesting trends. Have a look, for example, at the latest NVidia GPU architecture -- it is definitely worth study. A bank of 256KB memory is supplied which can be partitioned between two purposes - either as local cache or as local storage. The local storage option then partakes of a global memory address map, where access to non-local addresses uses a comms architecture (not a shared bus). The amount of cache vs storage is configurable at run-time.

Another aspect of NVidia's design and more recent AMD designs is of course the concept of core clusters, in which a group of 4-16 cores share some resources. In an ideal world this would not be necessary but physics tends to demand this sort of outcome, and it is probably worth investigating for more general purpose solutions.

Use cases would be at the center because there would be a manufacturing process that cheaply varies parameters to create a chip optimized for any given use case.

You’ll also need more communication capability to deal with the number of processors. It’s absolutely the case that the transputer processor is underpowered by today’s standards - I don’t know how by much.

I wonder if this is true - if you analyze it in time units of clock cycles per single core. I don't think it is true, if you analyze it in clock cycles per million transistors.

What are you trying to do? If you are trying to get 2 processors to solve a problem faster than 1 as a reasonable cost - then you need to be using your resources quite well - one limited resource is RAM - you probably need to work it hard. 32-cycles for a 32-bit multiply doesn’t cut it against a 1-multiply per cycle processor. The thing is, the cost of certain useful functions (multiplication) are pretty cheap compared with the cost of a processor; not being reasonably competitive on performance means you need too many processors to solve the problem. Now exactly how much processing you need to get a balanced system I do not know - but faster processors mean less of them, means less communication infrastructure, less total RAM

One thing I have been contemplating for some time is what it would take to make a modern transputer. I feel the critical element of a design is provision of channel/link hardware that has minimal setup time and DMA driven access. I feel reducing latency, especially setup time, requires a coprocessor-like interface to the cpu, so that a single processor instruction can initiate comms. If hardware access were required over PCIe, for example, it would take hundreds of processor cycles. Pushing that work into a coprocessor enables the main cpu to get on with other things, and maximises the chance that the comms will happen as fast as possible.

The other side of the coin would be that if the link engine(s) were essentially all wormhole routers, as for the C104 router chips, complete with packet addressing. Thus the link coprocessor would essentially become some number of interfaces directly to the CPU plus some number of interfaces to the external world, with a crossbar in the middle. This would massively increase the communications effectiveness of the design, and while taking up much more silicon area, I believe it would be an overall benefit for any non-trivial system. One net result is the elimination of the massive amount of 'receive and pass on' routing code that used to be needed with a directly connected link design.

The final element of the mix would be to engineer the system such that software virtualisation of links was standard -- as was true on the transputer -- so code could think just about performing communication, not about which physical layer was involved, and also a way for the link engine to raise an exception (e.g. sw interrupt) to the processor if it cannot complete the communication 'instantly', thus potentially requiring a thread to suspend or be released.

I don't know, but from what I have seen so far I don't think it is worth the complexity and constraint of putting supprot for interleaved threads into the processor hardware, as the Ts did, but do feel it is valuable for the hardware to provide appropriate hooks for a light threaded kernel to do the job efficiently.

So, to be clear, my ideal 'modern transputer' would be:

- something similar to an ARM M4 CPU core at something like 500MHz with its own L1 I and D cache, e.g. 8KB each;

- at least 256KB SRAM, partitionable into (shared I/D-cache) or local storage (and clusterable);

- a comms coprocessor capable of single CPU cycle comms setup, and autonomous link operation for non-local packets, and fitted with at least 4 external links and at least two direct to CPU links.

I would want to research further whether a fixed grid connection of these was adequate, or whether a more advanced option (that enabled some communications to travel further in one hop) was better. Similarly, selecting the most effective number of CPU links would need investigation (this number defines the max true parallelism of link comms to/from the processor).

Also:

- a core cluster approach to memory that enables sharing 4 CPUs storage memory (i.e. up to ~960KB directly addressable local RAM);

- an external memory interface permitting all core clusters to access an amount of off-chip memory at some (presumably slow) speed, all clusters seeing the memory at the same addresses;

I would hope to get 64 cores on a chip as 16 clusters, though that's probably impossible for current FPGA density because of the RAM...?

Regards

Ruth

--

Software Manager & Engineer

Tel: 01223 414180

Blog: http://www.ivimey.org/blog

LinkedIn: http://uk.linkedin.com/in/ruthivimeycook/

********************************************************************
This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender.
You should not copy it or use it for any purpose nor disclose or
distribute its contents to any other person.
********************************************************************

BAE Systems may process information about you that may be subject to data protection
laws. For more information about how we use your personal information, how we protect
your information, our legal basis for using your information, your rights and who you can
contact, please refer to our Privacy Notice at www.baesystems.com/en/privacy