Hi all,
Denis's contribution is so big that I am mostly just going to salute it rather than itemizing a response. I am here replying to his initial essay, but without neglecting his and others' continuations on this thread, just so the essay is easily there to view. These threads (Transistor count and Scientific processors) ought to be published as a book ;-)
Denis pretty much confirms that most of the floating point is Linpack-like (multiply accumulate with indexing and grid peculiarities), but then focuses on inefficiencies and errors due to roundoff error (and its cousin, underflow/denormalization). I believe this is a problem that is susceptible to theoretical investigation, based on use cases. Computational models must be required to be insensitive to error and underflow (except where the science says the real case multiplies the effects of small differences).
When I worked on fluid flow, I was immediately drawn to conservative expressions of the equations, due to spurious sources and sinks caused by roundoff and approximation error. I think this stability principle is extremely important - so much so that I propose this rule of thumb: If the problem can be well-posed from a flow and stability point of view, even if the discretization is too crude, the result will tend to be qualitatively correct. This tends to hold even for "bursting strength" types of problems which have discontinuous solutions.
The multiple time/length scale problem can often be attacked by solving multiple different kinds of approximations, and allowing them to feed back into each other. That requires more theoretical study, which is needed for climate models.
Here's another thought: imagine an idealized analog computer. Is your solution approach a stable approximation to this? An analog computer treats underflow right, losing it in the noise, and also "breaks" when small differences really multiply, as in failure of materials.
Larry
Hi All,
After I stopped working on Transputer-based machines, I ran a High
Performance Computing Initiative centre supporting mainly
natural environment modelling, particularly global ocean
circulation. I also ran the benchmarking for two generations of UK
national research supercomputers. All this was nearly twenty years
ago, but some knowledge from that generation may still be useful.
Here goes: 1. Traditional scientific computing is completely dominated by
linear algebra. Supercomputer vendors love to sell their wares
using Linpack, which is a dense matrix inversion code.
Real scientific computing is, however, dominated by the solving of
Partial Differential Equations, either in "real" space, or
in "Fourier" space (the so-called spectral methods). Real space
solvers work on a sparse matrix; spectral solvers do a
lot of fast Fourier transforms. So, almost all the actual
calculation can be done well using fused multiply-accumulate
instructions; it can also often be organised to work well with
short vector operations. Depending on what is being solved, there
can be a lot of "housekeeping", e.g. because the FFT is on a
sphere (global ocean), because there is some sort of model-defined
mesh (e.g. for engineered structures), or because the computation
has to iterate over different length scales (multigrid methods).
Within a compute node, the trick is to control the housekeeping so
the FPUs can let rip. 2. Integer multiply used to be important for indexing into
multi-dimensional structures. As I remember it, the IBM RS/6000
was the first microprocessor to have strength reducing
compilers that were so effective that the integer multiplier could
be made slower and cheaper. I guess, however, that integer
multiplier size does not matter so much nowadays. 3. In most systems, the real load is taken by the floating point
units. The IEEE standard is important here for several
reasons.
- Floating multiply is relatively easy as the mantissa is
shorter than for integers and there is not much normalisation to
do. In contrast, addition can require extensive shifting and
exponent correction. Division is a pain but is rarely necessary;
after all, any complex _expression_ can easily be converted into
one with only a single division. Floating point makes the
necessary scaling relatively easy.
- Support for denormalised numbers can be important for
precision, for stability, and for "compliance". Would you
believe a climate prediction performed on a machine whose
arithmetic does not conform to the established standard? It used
to be routine for the denormalised number handling to run
through a slow microcoded path. That can be a performance
disaster. Consider, for example, an infinite impulse
response filter in which the key operation is
yn+1 = yn
* 0.8 + xn * 0.2
If the input x goes to zero, y will decay into a small
denorm. number and settle there; it will never fall to zero. If
this were a filter on an audio input, all would be well while
you kept talking but, when you turn off the mic., the filter
runs much slower. On an Opteron, this could be a
thousand times slower. The difficulty is compounded for vector
operations, as in a multimedia unit, where the whole
vector is slowed down by one denorm.
- Floating point arithmetic is famously not associative. This
heavily restricts the optimisations which can be performed while
retaining bit-for-bit identical results. You either accept that
the answers can change, write your code very carefully to
pre-implement the optimisations, or go slow.
- Neither integer nor floating-point arithmetic is "complete";
there are overflow and divide-by-zero exceptions. What are we
going to do about them? Actually throwing an exception is a
performance nightmare. In the FPU, we can use the various
infinities and NaNs to keep the numbers flowing through, but are
NaNs on the fast path? Or are they also microcoded?
- Given that most of what we will actually do will be
multiply-accumulates, we might think about using a full-length
accumulator for floating addition; we run the accumulator as a
very long scaled integer. That will be more "accurate" and more
associative, but is not very "standard", although efforts have
been made.
4. Getting bit-for-bit matching answers from consecutive runs is
really difficult. Obviously, we need to seed our PRNGs
consistently, but there are all sorts of internal code and
optimisation issues that break things. This leads to real
difficulty in verifying benchmark tests. Overlaid on this are
sometimes genuine instabilities in important scientific codes. For
example, global ocean models can be very hard to "spin up"; you
need exactly the right initial conditions or the circulation never
converges to that of the planet we know. This may not even be a
problem in the models; perhaps the real equations of motion have
multiple fixed points? There are similar difficulties in molecular
dynamics around hydrogen bonding. Sadly, that is the case we
care about most; it covers protein folding in hydrated systems. 5. Some important computations depend on long-range interactions.
I have already mentioned multigrid methods, but there are
real physical systems that depend on long range "forces". Two that
spring to mind are Fermion calculations, where the wave
functions have to be anti-symmetric (you can't put two electrons
in the same state), and meteorology, where vertical radiative heat
transfer between layers is important.
6. Various sorts of multiphysics calculations depend on
coupled interactions between very different length or time scales.
Coupled ocean and climate models are a good example; the deep
oceans provide a much longer term store of heat, salinity
layering, and CO2 than the atmosphere. This all
complicates the housekeeping. 7. Most science needs to use lockstep "{calculate communicate}
repeat" computation, perhaps with overlapping and some red-black
cleverness. This requires the computational step to take the same
time on every processor. The FPU issues in "3" can cause a
problem; other difficulties can arise when cache aliasing is worse
in one particular processor. A famous historic example involved a
global virtual address space shared over a large parallel
computer. On exactly one processor, the code for the inner loop
aliased on top of the region of data it was using; that one
processor ran slowly and held up the whole system.
8. So, it's quite hard to get an interesting model to work at
all. Things get even harder when we add constraints such as a high
degree of parallelism or limited local memory. It is entirely
likely that an established model will have just a few parameters
that can be "tweaked" for various architectures. These typically
allow for cache sizes, for the number of processors, for the
length of MPI (message passing interface) communications,
but not much else. If you design something radical that requires
substantial porting of applications, it won't get used. In that
context, I see that a lot of GPUs are finding their way into
established supercomputers, but I do not have any clear data about
their effectiveness.
Best wishes,
Denis Nicole
|