This email may contain proprietary information of BAE Systems and/or third parties.
On Ruth’s ideal transputer: I spend my life, or it seems a significant part of it, doing electromagnetic interaction simulations. We run on up to 2000 cores at a time and each simulation
takes from 3 to 15 days. I would like to run much faster and use many more cores, but the lack of sufficient cache ram on the processors stops this. In fact, when I say 2000 cores, I mean I monopolise 2000 cores, and I actually am running on half or fewer.
When we benchmark the code (which only a couple of years ago was considered “embarrassingly parallel”) we find the code – running our problems – runs faster up to around 1600 cores but with fewer and fewer cores per processor in operation. Above this the
code runs more and more slowly until, at about 2000 cores it is not worth adding more, and beyond this the performance drops catastrophically. The unused cores cannot be used by other processes without inflicting the same performance penalties. This appears
to be almost entirely a limit on the cache memory available to each processor. Thus indicating that 2MB per core is insufficient and the beneficial effects of more cache increases to about 6 or 8 with a sweet spot around 4 to 5MBs. I do not for one moment assume this is the same for all computing problems, indeed, I think it is very specific to particular codes running particular problems,
but I do think it is illustrative of the problem, and of the immense importance of sufficient on-processor fast memory. I am looking forward to benchmarking the later AMD chips with much more cache. I also think the 4 links is not enough. 6 should be a minimum for 3D simulations and 10-20 would be better more up to date unstructured problems. …. Just one user’s view. Regards, Chris
Prof. Christopher C R Jones BSc. PhD C.Eng. FIET
BAE Systems Engineering Fellow
EMP Fellow of the Summa Foundation Principal Technologist – Electromagnetics
Military Air & Information
(
Direct: +44 (0) 3300 477425 Electromagnetic Engineering, W423A
(
Mobile: +44 (0)7855 393833 Engineering Integrated Solutions
7
Fax: +44 (0)1772 8 55262 Warton Aerodrome
* E-mail:
chris.c.jones@xxxxxxxxxxxxxx
Preston
:
Web:
www.baesystems.com PR4 1AX BAE Systems (Operations) Limited From: occam-com-request@xxxxxxxxxx
[mailto:occam-com-request@xxxxxxxxxx] On Behalf Of Ruth Ivimey-Cook
This email has been sent from an account outside of the BAE Systems network. I've been following this discussion with great interest. On 23/11/2020 21:43, Roger Shepherd wrote:
I agree that memory needs to be very local, or the processor will be starved of work. The harder problem is how much memory needs to be local. Modern highly parallel processors are showing some interesting trends. Have a look, for example, at the latest
NVidia GPU architecture -- it is definitely worth study. A bank of 256KB memory is supplied which can be partitioned between two purposes - either as local cache or as local storage. The local storage option then partakes of a global memory address map, where
access to non-local addresses uses a comms architecture (not a shared bus). The amount of cache vs storage is configurable at run-time. Another aspect of NVidia's design and more recent AMD designs is of course the concept of core clusters, in which a group of 4-16 cores share some resources. In an ideal world this would not be necessary but physics tends to demand this sort of outcome,
and it is probably worth investigating for more general purpose solutions.
One thing I have been contemplating for some time is what it would take to make a modern transputer. I feel the critical element of a design is provision of channel/link hardware that has minimal setup time and DMA driven access. I feel reducing latency,
especially setup time, requires a coprocessor-like interface to the cpu, so that a single processor instruction can initiate comms. If hardware access were required over PCIe, for example, it would take hundreds of processor cycles. Pushing that work into
a coprocessor enables the main cpu to get on with other things, and maximises the chance that the comms will happen as fast as possible. The other side of the coin would be that if the link engine(s) were essentially all wormhole routers, as for the C104 router chips, complete with packet addressing. Thus the link coprocessor would essentially become some number of interfaces directly to
the CPU plus some number of interfaces to the external world, with a crossbar in the middle. This would massively increase the communications effectiveness of the design, and while taking up much more silicon area, I believe it would be an overall benefit
for any non-trivial system. One net result is the elimination of the massive amount of 'receive and pass on' routing code that used to be needed with a directly connected link design. The final element of the mix would be to engineer the system such that software virtualisation of links was standard -- as was true on the transputer -- so code could think just about performing communication, not about which physical layer was involved,
and also a way for the link engine to raise an exception (e.g. sw interrupt) to the processor if it cannot complete the communication 'instantly', thus potentially requiring a thread to suspend or be released. I don't know, but from what I have seen so far I don't think it is worth the complexity and constraint of putting supprot for interleaved threads into the processor hardware, as the Ts did, but do feel it is valuable for the hardware to provide appropriate
hooks for a light threaded kernel to do the job efficiently. So, to be clear, my ideal 'modern transputer' would be: - something similar to an ARM M4 CPU core at something like 500MHz with its own L1 I and D cache, e.g. 8KB each; - at least 256KB SRAM, partitionable into (shared I/D-cache) or local storage (and clusterable); - a comms coprocessor capable of single CPU cycle comms setup, and autonomous link operation for non-local packets, and fitted with at least 4 external links and at least two direct to CPU links. I would want to research further whether a fixed grid connection of these was adequate, or whether a more advanced option (that enabled some communications to travel further in one hop) was better. Similarly, selecting the most effective number of CPU
links would need investigation (this number defines the max true parallelism of link comms to/from the processor). Also: - a core cluster approach to memory that enables sharing 4 CPUs storage memory (i.e. up to ~960KB directly addressable local RAM); - an external memory interface permitting all core clusters to access an amount of off-chip memory at some (presumably slow) speed, all clusters seeing the memory at the same addresses; I would hope to get 64 cores on a chip as 16 clusters, though that's probably impossible for current FPGA density because of the RAM...? Regards Ruth -- Software Manager & Engineer Tel: 01223 414180 Blog: http://www.ivimey.org/blog LinkedIn: http://uk.linkedin.com/in/ruthivimeycook/ ******************************************************************** BAE Systems may process information about you that may be subject to data protection |