RE: Transistor count

This email may contain proprietary information of BAE Systems and/or third parties.

Larry,

Thank you for taking an interest.

(1) Were you able to predict the drop-off at 2000, or did it come as a surprise?

Yes, I did predict the effect from extrapolations to benchmark data. Actually, my initial prediction were so catastrophic, my colleagues thought they would not be believed by our long-trousered brigade and the slope of the decline was “tempered” to make it more believable – the trend was undeniable. The actual results were just as I had originally predicted.

(2) In either case, what was the cause (or at least the symptoms)?

The picture above is the “tempered” version as I cannot find the original – will keep searching – the actual declines were earlier and steeper. Unfortunately the performance was somewhat short of what we were trying to achieve. (I have been campaigning for a long time to get our HPC procured against a performance target – we use iterations per hour of out code running our dataset - rather than a number of cores.)

(3) What do you mean by "hide the message passing"? Is it just that message passing threatens to overwhelm processing in amount of effort needed, or is it something else?

Our aim, since the code was originally written in occam, has been to maximise the efficiency of the numerical implementation and then to realise a load level for each processor such that each did as near as possible, the same amount of work which optimally would take just as long as the communications required between the processors. We could do this with occam message passing, but it has become, not only more difficult with Fortran, C and C++, and with MPI, but more difficult to see what and how you are doing. With modern processors, the other key ingredient seems to be cache RAM which has become a bigger issue as processor performance has outstripped RAM performance by such a large degree. I am sure there are many other contributing issues too.

(4) Why do you run on just half the cores? Maybe I am being naive, but I would think if each core was assigned 1/2000 of the problem, they would all have to run.

Because the code runs significantly faster on partially used processors than on fully loaded ones. We think this is mostly to do with the cache RAM which is available to all cores as they need it, so using fewer cores leaves more for the cores that are working. This is one of the reasons I want to get away from specifying a computing requirement as a number of cores. I do not really care how many cores I am using if I get the computing speed I need. (Well, I do actually, because no one else seems to bother about this issue. The people who buy the machines are much more concerned about the floor area, the electricity supply, the cooling and the fire suppression, and want to minimise them all. Satisfying the computing requirement is viewed as “covered” by a specification of the number of cores.)

(5) Are you seeing the floating point and denormalized number problems, or the problem of a slow node taking down the rest?

I do not think we are. In antenna modelling there are sometimes singularities which must be addressed specifically, but in the Electromagnetic Hazards, things are better behaved. We do have slower nodes but try hard to load-balance to minimise their impact. Tools to see how you are doing are pretty dreadful, so in recent development work we have been generating our own data, and load balancing, while we have to watch it, does not seem to be a big issue.

I hope these answers address you questions. Please ask again as it is good to have some one taking an interest.

Regards,

Chris

Prof. Christopher C R Jones BSc. PhD C.Eng. FIET

BAE Systems Engineering Fellow

EMP Fellow of the Summa Foundation

Principal Technologist – Electromagnetics

Military Air & Information ( Direct: +44 (0) 3300 477425

Electromagnetic Engineering, W423A ( Mobile: +44 (0)7855 393833

Engineering Integrated Solutions 7 Fax: +44 (0)1772 8 55262

Warton Aerodrome * E-mail: chris.c.jones@xxxxxxxxxxxxxx

Preston : Web: www.baesystems.com

PR4 1AX

BAE Systems (Operations) Limited
Registered Office: Warwick House, PO Box 87, Farnborough Aerospace Centre, Farnborough, Hants, GU14 6YU, UK
Registered in England & Wales No: 1996687
Exported from the United Kingdom under the terms of the UK Export Control Act 2002 (DEAL No 8106)

From: Larry Dickson [mailto:tjoccam@xxxxxxxxxxx]
Sent: 09 December 2020 02:55
To: Jones, Chris C (UK Warton)
Cc: Ruth Ivimey-Cook; David May; Paul Walker; occam-com@xxxxxxxxxx; Tony Gore; Uwe Mielke; Denis A Nicole; Øyvind Teig; Michael Bruestle; Transputer TRAM
Subject: Re: Transistor count

PHISHING ALERT

This email has been sent from an account outside of the BAE Systems network.

Please treat the email with caution, especially if you are requested to click on a link or open an attachment.
For further information on how to spot and report a phishing email please access the Global Intranet then select <Functions> / <IT>.

Chris,

What you say is extremely interesting, and I am listening intently, because you are one of the few that is still hammering away. (Except for GPU, apparently, and maybe AI.) But I still have a few questions.

(1) Were you able to predict the drop-off at 2000, or did it come as a surprise?

(2) In either case, what was the cause (or at least the symptoms)?

(3) What do you mean by "hide the message passing"? Is it just that message passing threatens to overwhelm processing in amount of effort needed, or is it something else?

(4) Why do you run on just half the cores? Maybe I am being naive, but I would think if each core was assigned 1/2000 of the problem, they would all have to run.

(5) Are you seeing the floating point and denormalized number problems, or the problem of a slow node taking down the rest?

Larry

On Dec 8, 2020, at 3:18 PM, Jones, Chris C (UK Warton) <chris.c.jones@xxxxxxxxxxxxxx> wrote:

This email may contain proprietary information of BAE Systems and/or third parties.

Larry,

Not all of us have given up exactly, but there are so many barriers to achieving massive parallelism. I have to pack lots of processing into each core to hide the message passing, then I need loads of cache ram to make the amount of work efficient, that means I cannot efficiently make use of all the cores available on each processor – we run on just half most of the time and just waste the rest. Thus, for our types of problem, I can find the sweet spot in terms of the number of cores and processors for a particular case, which in most cases is between 500 and 1800 cores. Below the sweet spot the code runs linearly more slowly, at the sweet spot a single run might take up to 10 days or more, but above the sweet spot, the code runs progressively more slowly turning over remarkably quickly to, not just diminishing returns, but catastrophic slowing down so that at about 2000 cores we might as well run on a big workstation.

This means that I can increase throughput by running more cases simultaneously provided I have many cases to run. What I cannot do is to run the problems faster, and that is what I need. I was offered a computer as big as a block of flats that other day (in jest, I should add) but I had to say that it would still not make the code run faster. That is the tragedy of present day parallel computing and HPC. Is this why the term supercomputing seems to have been dropped?

By the way, I am still running and “embarrassingly parallel” code.

Regards,

Chris

Prof. Christopher C R Jones BSc. PhD C.Eng. FIET

BAE Systems Engineering Fellow

EMP Fellow of the Summa Foundation

Principal Technologist – Electromagnetics <image001.jpg>

Military Air & Information ( Direct: +44 (0) 3300 477425

Electromagnetic Engineering, W423A ( Mobile: +44 (0)7855 393833

Engineering Integrated Solutions 7 Fax: +44 (0)1772 8 55262

Warton Aerodrome * E-mail: chris.c.jones@xxxxxxxxxxxxxx

Preston : Web: www.baesystems.com

PR4 1AX

From: occam-com-request@xxxxxxxxxx [mailto:occam-com-request@xxxxxxxxxx] On Behalf Of Larry Dickson
Sent: 08 December 2020 01:37
To: Ruth Ivimey-Cook
Cc: David May; Paul Walker; Jones, Chris C (UK Warton); occam-com@xxxxxxxxxx; Tony Gore; Uwe Mielke; Denis A Nicole; Øyvind Teig; Michael Bruestle; Transputer TRAM
Subject: Re: Transistor count

PHISHING ALERT

Ruth and David,

I am trying to follow your discussion here, but let me cut to the chase with the implication that communication setup time has stalled at 1 us and not improved since the Transputer. Measured in cycles, that seems to mean communication setup time has INCREASED from 20 cycles to about 2000 cycles. This problem is not new. I remember one of the occam porting projects, to (I believe) the PowerPC (I can't find the reference), actually tackled external communication, but the overhead was about 700 cycles if I remember right.

Fixing the problems (as David says) that slow you down by a factor of 100 certainly seems to make sense. I wonder what the physical limits are if you try to make a 2GHz Transputer? Speed of light for 20 cycles would still give you 3 meters. I know this is crude extrapolation, but . . . I wonder if the real problem is that people have given up on massively parallel programming (except for GPUs).

Larry

On Dec 6, 2020, at 1:42 PM, Ruth Ivimey-Cook <ruth@xxxxxxxxxx> wrote:

Hi David,

Thanks for your thoughts.

It is in part the off-chip/on-chip disparity that makes me think that a co-processor style arrangement for comms is what is needed. As soon as the CPU has to go off-chip all the latencies grow massively, so keep it local! Again, the need to flip to kernel mode and initialise DMA and interrupts makes me feel that a coprocessor that "understands" what is going on, and can therefore avoid needing to be told everything explicitly each time, is the right path. Ideally, the 'start communication' instruction would take as long to execute as a multiply, essentially transferring values from cpu registers to a queue on the hardware, with another instruction that executes a software trap if the operation hasn't completed (so the calling thread can wait on completion).

The x86_64 cpus from Intel and AMD both have very high speed inter-core interconnects, though I know little about them. They are apparently good for bus snooping and therefore of tranferring short (cache-line = 64 .. 256 bytes) length blocks of data very quickly. It would be really interesting to know if this hardware could be targetted by specialised processor instructions to make it useful for explicit communication, rather than implicit.

You're right about memory, of course, though of late the DDR line of memory has been improving faster than it did for quite a while (for about 10 years, performance improved by about 20%, but in the last 3 or 4 it's almost doubled). However, that doubling performance is set against the 100 to 1000 fold performance differential to the cpu. Recent generations of memory interface have been trying to address this with wider memory busses -- some Intel chips now have 4 parallel channels of DDR4 memory for 1024-bit wide interface, but although this improves overall bus speed it does little for the latency of any individual transfer. The famous "Spectre" (etc) bugs in chip design come about mostly in the efforts to hide this latency from the core.

The only way I've seen of addressing the issue is to being memory on-chip... which is what High Bandwidth Memory (HBM) does. It puts a large DRAM chip physically on top of the processor, with parallel interconnections all across the plane. Thus both a wide memory interface, and a much higher speed (short & small wires) one can be achieved. The downside is that the size of memory is limited to what can fit in one chip, and it also limits processor power dissipation.

No magic bullets then. But it will take a lot of pressure to push the software industry to change their methods to enable the adoption of more parallel-friendly hardware, and I don't see that happening at present. The current pressures on programming are all aimed at reducing the cost to market, even at the expense of later maintenance. Achieved software quality is (IMO) low, and getting worse.

Best wishes,

Ruth

On 06/12/2020 18:17, David May wrote:

Hi Ruth,

There are quite a few studies of HPC message latency - for example https://www.hpcwire.com/2019/12/02/it-is-about-latency/.

Ideally, a processor should be able to communicate data as fast as it can process data. And ideally, this should apply to small items of data, not just large ones. We’d like to be able to off-load a procedure call to another processor, for example. The transputer came close to this in 1985. Since then, processor performance has increased by a factor of more than 1,000 (a combination of clock speed and instruction-level parallelism), but communication performance hasn’t increased much at all (except for very large data items). The issue seems to be primarily the set-up and completion times - switch to kernel, initialise DMA controller, de-schedule process … interrupt, re-schedule process. However, there is also an issue with chip-to-chip communication. If you build chips with lots of processors - potentially hundreds with 5nm process technology - the imbalance between on-chip processing performance and inter-chip communication performance is vast (both latency and throughput).

I made some proposals about these issues in http://www.icpp-conf.org/2017/files/keynote-david-may.pdf (especially from slide 27).

Incidentally, the problem with most current architectures isn’t just communications - the memory systems don’t work well either (see from slide 17 of above)!

It is entirely possible to fix the problems with communication set-up and completion - one option is the microprogrammed transputer-style implementation; another is the multithreaded XCore style with single-cycle i/o instructions (which means that threads can act as programmable DMA controllers); yet another is to provide execution modes (with dedicated registers) for complex instructions and interrupts - this enables the process management to be done by software and was first used on the Atlas computers of the 1960s (Normal control, Extracode control, Interrupt control)!

All the best

David

On 5 Dec 2020, at 03:04, Ruth Ivimey-Cook <ruth@xxxxxxxxxx> wrote:

David,

Did you see my comments earlier, where I was talking about, among other things, my ideas for a modern transputer? I mentioned various points about latency, and would be interested on your thoughts. I have quoted the relevant part below in case not.

I'm raising it again because you mentioned the 1us latency (I presume this is the time to indicate a communication is needed, not to complete it, so no overlapping of this with other work). It doesn't surprise me: most I/O subsystems seem to be adequately optimised for bulk throughput but really awful on startup latency.

On 04/12/2020 19:24, David May wrote:

This issue - and the related issue of processor-interconnect latency - is the main reason that we’re not doing much parallel computing. The interprocessor communication latency is still around 1microsecond (same as transputer) for a short message. As a result, many ‘supercomputers’ are just used as clusters running scripts that launch lots of small jobs.

Best wishes,

Ruth

On 23 Nov 2020 23:35, Ruth Ivimey-Cook wrote:

One thing I have been contemplating for some time is what it would take to make a modern transputer. I feel the critical element of a design is provision of channel/link hardware that has minimal setup time and DMA driven access. I feel reducing latency, especially setup time, requires a coprocessor-like interface to the cpu, so that a single processor instruction can initiate comms. If hardware access were required over PCIe, for example, it would take hundreds of processor cycles. Pushing that work into a coprocessor enables the main cpu to get on with other things, and maximises the chance that the comms will happen as fast as possible.

The other side of the coin would be that if the link engine(s) were essentially all wormhole routers, as for the C104 router chips, complete with packet addressing. Thus the link coprocessor would essentially become some number of interfaces directly to the CPU plus some number of interfaces to the external world, with a crossbar in the middle. This would massively increase the communications effectiveness of the design, and while taking up much more silicon area, I believe it would be an overall benefit for any non-trivial system. One net result is the elimination of the massive amount of 'receive and pass on' routing code that used to be needed with a directly connected link design.

The final element of the mix would be to engineer the system such that software virtualisation of links was standard -- as was true on the transputer -- so code could think just about performing communication, not about which physical layer was involved, and also a way for the link engine to raise an exception (e.g. sw interrupt) to the processor if it cannot complete the communication 'instantly', thus potentially requiring a thread to suspend or be released.

I don't know, but from what I have seen so far I don't think it is worth the complexity and constraint of putting supprot for interleaved threads into the processor hardware, as the Ts did, but do feel it is valuable for the hardware to provide appropriate hooks for a light threaded kernel to do the job efficiently.

On 06/12/2020 18:17, David May wrote:

Hi Ruth,

There are quite a few studies of HPC message latency - for example https://www.hpcwire.com/2019/12/02/it-is-about-latency/.

Ideally, a processor should be able to communicate data as fast as it can process data. And ideally, this should apply to small items of data, not just large ones. We’d like to be able to off-load a procedure call to another processor, for example. The transputer came close to this in 1985. Since then, processor performance has increased by a factor of more than 1,000 (a combination of clock speed and instruction-level parallelism), but communication performance hasn’t increased much at all (except for very large data items). The issue seems to be primarily the set-up and completion times - switch to kernel, initialise DMA controller, de-schedule process … interrupt, re-schedule process. However, there is also an issue with chip-to-chip communication. If you build chips with lots of processors - potentially hundreds with 5nm process technology - the imbalance between on-chip processing performance and inter-chip communication performance is vast (both latency and throughput).

I made some proposals about these issues in http://www.icpp-conf.org/2017/files/keynote-david-may.pdf (especially from slide 27).

Incidentally, the problem with most current architectures isn’t just communications - the memory systems don’t work well either (see from slide 17 of above)!

It is entirely possible to fix the problems with communication set-up and completion - one option is the microprogrammed transputer-style implementation; another is the multithreaded XCore style with single-cycle i/o instructions (which means that threads can act as programmable DMA controllers); yet another is to provide execution modes (with dedicated registers) for complex instructions and interrupts - this enables the process management to be done by software and was first used on the Atlas computers of the 1960s (Normal control, Extracode control, Interrupt control)!

All the best

David

On 5 Dec 2020, at 03:04, Ruth Ivimey-Cook <ruth@xxxxxxxxxx> wrote:

David,

Did you see my comments earlier, where I was talking about, among other things, my ideas for a modern transputer? I mentioned various points about latency, and would be interested on your thoughts. I have quoted the relevant part below in case not.

I'm raising it again because you mentioned the 1us latency (I presume this is the time to indicate a communication is needed, not to complete it, so no overlapping of this with other work). It doesn't surprise me: most I/O subsystems seem to be adequately optimised for bulk throughput but really awful on startup latency.

On 04/12/2020 19:24, David May wrote:

This issue - and the related issue of processor-interconnect latency - is the main reason that we’re not doing much parallel computing. The interprocessor communication latency is still around 1microsecond (same as transputer) for a short message. As a result, many ‘supercomputers’ are just used as clusters running scripts that launch lots of small jobs.

Best wishes,

Ruth

On 23 Nov 2020 23:35, Ruth Ivimey-Cook wrote:

One thing I have been contemplating for some time is what it would take to make a modern transputer. I feel the critical element of a design is provision of channel/link hardware that has minimal setup time and DMA driven access. I feel reducing latency, especially setup time, requires a coprocessor-like interface to the cpu, so that a single processor instruction can initiate comms. If hardware access were required over PCIe, for example, it would take hundreds of processor cycles. Pushing that work into a coprocessor enables the main cpu to get on with other things, and maximises the chance that the comms will happen as fast as possible.

The other side of the coin would be that if the link engine(s) were essentially all wormhole routers, as for the C104 router chips, complete with packet addressing. Thus the link coprocessor would essentially become some number of interfaces directly to the CPU plus some number of interfaces to the external world, with a crossbar in the middle. This would massively increase the communications effectiveness of the design, and while taking up much more silicon area, I believe it would be an overall benefit for any non-trivial system. One net result is the elimination of the massive amount of 'receive and pass on' routing code that used to be needed with a directly connected link design.

The final element of the mix would be to engineer the system such that software virtualisation of links was standard -- as was true on the transputer -- so code could think just about performing communication, not about which physical layer was involved, and also a way for the link engine to raise an exception (e.g. sw interrupt) to the processor if it cannot complete the communication 'instantly', thus potentially requiring a thread to suspend or be released.

I don't know, but from what I have seen so far I don't think it is worth the complexity and constraint of putting supprot for interleaved threads into the processor hardware, as the Ts did, but do feel it is valuable for the hardware to provide appropriate hooks for a light threaded kernel to do the job efficiently.

********************************************************************
This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender.
You should not copy it or use it for any purpose nor disclose or
distribute its contents to any other person.
********************************************************************

BAE Systems may process information about you that may be subject to data protection
laws. For more information about how we use your personal information, how we protect
your information, our legal basis for using your information, your rights and who you can
contact, please refer to our Privacy Notice at www.baesystems.com/en/privacy