This email may contain proprietary information of BAE Systems and/or third parties.
Larry, Thank you for taking an interest. (1) Were you able to predict the drop-off at 2000, or did it come as a surprise? Yes, I did predict the effect from extrapolations to benchmark data. Actually, my initial prediction were so catastrophic, my colleagues
thought they would not be believed by our long-trousered brigade and the slope of the decline was “tempered” to make it more believable – the trend was undeniable. The actual results were just as I had originally predicted. (2) In either case, what was the cause (or at least the symptoms)? The picture above is the “tempered” version as I cannot find the original – will keep searching – the actual declines were earlier
and steeper. Unfortunately the performance was somewhat short of what we were trying to achieve. (I have been campaigning for a long time to get our HPC procured against a performance target – we use iterations per hour of out code running our dataset -
rather than a number of cores.) (3) What do you mean by "hide the message passing"? Is it just that message passing threatens to overwhelm processing in amount of effort needed, or is it something else? Our aim, since the code was originally written in occam, has been to maximise the efficiency of the numerical implementation and then
to realise a load level for each processor such that each did as near as possible, the same amount of work which optimally would take just as long as the communications required between the processors. We could do this with occam message passing, but it has
become, not only more difficult with Fortran, C and C++, and with MPI, but more difficult to see what and how you are doing. With modern processors, the other key ingredient seems to be cache RAM which has become a bigger issue as processor performance has
outstripped RAM performance by such a large degree. I am sure there are many other contributing issues too. (4) Why do you run on just half the cores? Maybe I am being naive, but I would think if each core was assigned 1/2000 of the problem, they would all have to run. Because the code runs significantly faster on partially used processors than on fully loaded ones. We think this is mostly to do
with the cache RAM which is available to all cores as they need it, so using fewer cores leaves more for the cores that are working. This is one of the reasons I want to get away from specifying a computing requirement as a number of cores. I do not really
care how many cores I am using if I get the computing speed I need. (Well, I do actually, because no one else seems to bother about this issue. The people who buy the machines are much more concerned about the floor area, the electricity supply, the cooling
and the fire suppression, and want to minimise them all. Satisfying the computing requirement is viewed as “covered” by a specification of the number of cores.) (5) Are you seeing the floating point and denormalized number problems, or the problem of a slow node taking down the rest? I do not think we are. In antenna modelling there are sometimes singularities which must be addressed specifically, but in the Electromagnetic
Hazards, things are better behaved. We do have slower nodes but try hard to load-balance to minimise their impact. Tools to see how you are doing are pretty dreadful, so in recent development work we have been generating our own data, and load balancing,
while we have to watch it, does not seem to be a big issue. I hope these answers address you questions. Please ask again as it is good to have some one taking an interest. Regards, Chris
Prof. Christopher C R Jones BSc. PhD C.Eng. FIET
BAE Systems Engineering Fellow
EMP Fellow of the Summa Foundation Principal Technologist – Electromagnetics
Military Air & Information
(
Direct: +44 (0) 3300 477425 Electromagnetic Engineering, W423A
(
Mobile: +44 (0)7855 393833 Engineering Integrated Solutions
7
Fax: +44 (0)1772 8 55262 Warton Aerodrome
*
E-mail:
chris.c.jones@xxxxxxxxxxxxxx
Preston
:
Web:
www.baesystems.com PR4 1AX BAE Systems (Operations) Limited
This email has been sent from an account outside of the BAE Systems network.
What you say is extremely interesting, and I am listening intently, because you are one of the few that is still hammering away. (Except for GPU, apparently, and maybe AI.) But I still have a few questions. (1) Were you able to predict the drop-off at 2000, or did it come as a surprise? (2) In either case, what was the cause (or at least the symptoms)? (3) What do you mean by "hide the message passing"? Is it just that message passing threatens to overwhelm processing in amount of effort needed, or is it something else? (4) Why do you run on just half the cores? Maybe I am being naive, but I would think if each core was assigned 1/2000 of the problem, they would all have to run. (5) Are you seeing the floating point and denormalized number problems, or the problem of a slow node taking down the rest? Larry On Dec 8, 2020, at 3:18 PM, Jones, Chris C (UK Warton) <chris.c.jones@xxxxxxxxxxxxxx> wrote:
This email may contain proprietary information of BAE Systems and/or third parties. Larry, Not all of us have given up exactly, but there are so many barriers to achieving massive parallelism. I have to pack lots of processing into each core to hide
the message passing, then I need loads of cache ram to make the amount of work efficient, that means I cannot efficiently make use of all the cores available on each processor – we run on just half most of the time and just waste the rest. Thus, for our types
of problem, I can find the sweet spot in terms of the number of cores and processors for a particular case, which in most cases is between 500 and 1800 cores. Below the sweet spot the code runs linearly more slowly, at the sweet spot a single run might take
up to 10 days or more, but above the sweet spot, the code runs progressively more slowly turning over remarkably quickly to, not just diminishing returns, but catastrophic slowing down so that at about 2000 cores we might as well run on a big workstation. This means that I can increase throughput by running more cases simultaneously provided I have many cases to run. What I cannot do is to run the problems faster,
and that is what I need. I was offered a computer as big as a block of flats that other day (in jest, I should add) but I had to say that it would still not make the code run faster. That is the tragedy of present day parallel computing and HPC. Is this
why the term supercomputing seems to have been dropped? By the way, I am still running and “embarrassingly parallel” code. Regards, Chris Prof. Christopher C R Jones BSc. PhD C.Eng. FIET BAE Systems Engineering Fellow EMP Fellow of the Summa Foundation Principal Technologist – Electromagnetics <image001.jpg> Military Air & Information ( Direct:
+44 (0) 3300 477425 Electromagnetic Engineering, W423A ( Mobile:
+44 (0)7855 393833 Engineering Integrated Solutions 7 Fax:
+44 (0)1772 8 55262 Warton Aerodrome * E-mail: chris.c.jones@xxxxxxxxxxxxxx Preston : Web: www.baesystems.com PR4 1AX BAE Systems (Operations) Limited From: occam-com-request@xxxxxxxxxx [mailto:occam-com-request@xxxxxxxxxx] On
Behalf Of Larry Dickson
This email has been sent from an account outside of the BAE Systems network.
I am trying to follow your discussion here, but let me cut to the chase with the implication that communication setup time has stalled at 1 us and not improved since the Transputer. Measured in cycles, that seems to mean communication setup
time has INCREASED from 20 cycles to about 2000 cycles. This problem is not new. I remember one of the occam porting projects, to (I believe) the PowerPC (I can't find the reference), actually tackled external communication, but the overhead was about 700
cycles if I remember right. Fixing the problems (as David says) that slow you down by a factor of 100 certainly seems to make sense. I wonder what the physical limits are if you try to make a 2GHz Transputer? Speed of light for 20 cycles would still give you 3 meters.
I know this is crude extrapolation, but . . . I wonder if the real problem is that people have given up on massively parallel programming (except for GPUs). Larry On Dec 6, 2020, at 1:42 PM, Ruth Ivimey-Cook <ruth@xxxxxxxxxx> wrote:
Hi David, Thanks for your thoughts. It is in part the off-chip/on-chip disparity that makes me think that a co-processor style arrangement for comms is what is needed. As soon as the CPU has to go off-chip all the latencies grow massively, so keep it local! Again, the need to flip to kernel
mode and initialise DMA and interrupts makes me feel that a coprocessor that "understands" what is going on, and can therefore avoid needing to be told everything explicitly each time, is the right path. Ideally, the 'start communication' instruction would
take as long to execute as a multiply, essentially transferring values from cpu registers to a queue on the hardware, with another instruction that executes a software trap if the operation hasn't completed (so the calling thread can wait on completion). The x86_64 cpus from Intel and AMD both have very high speed inter-core interconnects, though I know little about them. They are apparently good for bus snooping and therefore of tranferring short (cache-line = 64 .. 256 bytes) length blocks of data very
quickly. It would be really interesting to know if this hardware could be targetted by specialised processor instructions to make it useful for explicit communication, rather than implicit. You're right about memory, of course, though of late the DDR line of memory has been improving faster than it did for quite a while (for about 10 years, performance improved by about 20%, but in the last 3 or 4 it's almost doubled). However, that doubling
performance is set against the 100 to 1000 fold performance differential to the cpu. Recent generations of memory interface have been trying to address this with wider memory busses -- some Intel chips now have 4 parallel channels of DDR4 memory for 1024-bit
wide interface, but although this improves overall bus speed it does little for the latency of any individual transfer. The famous "Spectre" (etc) bugs in chip design come about mostly in the efforts to hide this latency from the core. The only way I've seen of addressing the issue is to being memory on-chip... which is what High Bandwidth Memory (HBM) does. It puts a large DRAM chip physically on top of the processor, with parallel interconnections all across the plane. Thus both a wide
memory interface, and a much higher speed (short & small wires) one can be achieved. The downside is that the size of memory is limited to what can fit in one chip, and it also limits processor power dissipation. No magic bullets then. But it will take a lot of pressure to push the software industry to change their methods to enable the adoption of more parallel-friendly hardware, and I don't see that happening at present. The current pressures on programming are
all aimed at reducing the cost to market, even at the expense of later maintenance. Achieved software quality is (IMO) low, and getting worse. Best wishes, Ruth On 06/12/2020 18:17, David May wrote:
On 06/12/2020 18:17, David May wrote:
******************************************************************** BAE Systems may process information about you that may be subject to data protection |