Re: Scientific processors

On Nov 25, 2020, at 1:28 AM, Denis A Nicole <dan@xxxxxxxxxxxxxxx> wrote:

Hi All,

After I stopped working on Transputer-based machines, I ran a High Performance Computing Initiative centre supporting mainly natural environment modelling, particularly global ocean circulation. I also ran the benchmarking for two generations of UK national research supercomputers. All this was nearly twenty years ago, but some knowledge from that generation may still be useful. Here goes:
1. Traditional scientific computing is completely dominated by linear algebra. Supercomputer vendors love to sell their wares using Linpack, which is a dense matrix inversion code. Real scientific computing is, however, dominated by the solving of Partial Differential Equations, either in "real" space, or in "Fourier" space (the so-called spectral methods). Real space solvers work on a sparse matrix; spectral solvers do a lot of fast Fourier transforms. So, almost all the actual calculation can be done well using fused multiply-accumulate instructions; it can also often be organised to work well with short vector operations. Depending on what is being solved, there can be a lot of "housekeeping", e.g. because the FFT is on a sphere (global ocean), because there is some sort of model-defined mesh (e.g. for engineered structures), or because the computation has to iterate over different length scales (multigrid methods). Within a compute node, the trick is to control the housekeeping so the FPUs can let rip.
2. Integer multiply used to be important for indexing into multi-dimensional structures. As I remember it, the IBM RS/6000 was the first microprocessor to have strength reducing compilers that were so effective that the integer multiplier could be made slower and cheaper. I guess, however, that integer multiplier size does not matter so much nowadays.
3. In most systems, the real load is taken by the floating point units. The IEEE standard is important here for several reasons.

Floating multiply is relatively easy as the mantissa is shorter than for integers and there is not much normalisation to do. In contrast, addition can require extensive shifting and exponent correction. Division is a pain but is rarely necessary; after all, any complex _expression_ can easily be converted into one with only a single division. Floating point makes the necessary scaling relatively easy.

Support for denormalised numbers can be important for precision, for stability, and for "compliance". Would you believe a climate prediction performed on a machine whose arithmetic does not conform to the established standard? It used to be routine for the denormalised number handling to run through a slow microcoded path. That can be a performance disaster. Consider, for example, an infinite impulse response filter in which the key operation is
y_n+1 = y_n * 0.8 + x_n * 0.2
If the input x goes to zero, y will decay into a small denorm. number and settle there; it will never fall to zero. If this were a filter on an audio input, all would be well while you kept talking but, when you turn off the mic., the filter runs much slower. On an Opteron, this could be a thousand times slower. The difficulty is compounded for vector operations, as in a multimedia unit, where the whole vector is slowed down by one denorm.

Floating point arithmetic is famously not associative. This heavily restricts the optimisations which can be performed while retaining bit-for-bit identical results. You either accept that the answers can change, write your code very carefully to pre-implement the optimisations, or go slow.

Neither integer nor floating-point arithmetic is "complete"; there are overflow and divide-by-zero exceptions. What are we going to do about them? Actually throwing an exception is a performance nightmare. In the FPU, we can use the various infinities and NaNs to keep the numbers flowing through, but are NaNs on the fast path? Or are they also microcoded?

Given that most of what we will actually do will be multiply-accumulates, we might think about using a full-length accumulator for floating addition; we run the accumulator as a very long scaled integer. That will be more "accurate" and more associative, but is not very "standard", although efforts have been made.

4. Getting bit-for-bit matching answers from consecutive runs is really difficult. Obviously, we need to seed our PRNGs consistently, but there are all sorts of internal code and optimisation issues that break things. This leads to real difficulty in verifying benchmark tests. Overlaid on this are sometimes genuine instabilities in important scientific codes. For example, global ocean models can be very hard to "spin up"; you need exactly the right initial conditions or the circulation never converges to that of the planet we know. This may not even be a problem in the models; perhaps the real equations of motion have multiple fixed points? There are similar difficulties in molecular dynamics around hydrogen bonding. Sadly, that is the case we care about most; it covers protein folding in hydrated systems.
5. Some important computations depend on long-range interactions. I have already mentioned multigrid methods, but there are real physical systems that depend on long range "forces". Two that spring to mind are Fermion calculations, where the wave functions have to be anti-symmetric (you can't put two electrons in the same state), and meteorology, where vertical radiative heat transfer between layers is important.

6. Various sorts of multiphysics calculations depend on coupled interactions between very different length or time scales. Coupled ocean and climate models are a good example; the deep oceans provide a much longer term store of heat, salinity layering, and CO₂ than the atmosphere. This all complicates the housekeeping.
7. Most science needs to use lockstep "{calculate communicate} repeat" computation, perhaps with overlapping and some red-black cleverness. This requires the computational step to take the same time on every processor. The FPU issues in "3" can cause a problem; other difficulties can arise when cache aliasing is worse in one particular processor. A famous historic example involved a global virtual address space shared over a large parallel computer. On exactly one processor, the code for the inner loop aliased on top of the region of data it was using; that one processor ran slowly and held up the whole system.

8. So, it's quite hard to get an interesting model to work at all. Things get even harder when we add constraints such as a high degree of parallelism or limited local memory. It is entirely likely that an established model will have just a few parameters that can be "tweaked" for various architectures. These typically allow for cache sizes, for the number of processors, for the length of MPI (message passing interface) communications, but not much else. If you design something radical that requires substantial porting of applications, it won't get used. In that context, I see that a lot of GPUs are finding their way into established supercomputers, but I do not have any clear data about their effectiveness.

Best wishes,
Denis Nicole