[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Scientific processors



Hi Roger,

Good to hear from you.

Yes, non-associativity is a real problem, but I think most of the practical issues are local to a processor. In practice, if we are doing a distributed "reduce", it is quite likely that the MPI code in the application will impose a deterministic ordering. 

On the other hand, we can encourage the compiler to reorder FP operations with, for example, the GCC option -funsafe-math-optimizations.That may make bad things happen, but they will happen more quickly.

We might hope that the developers of scientific codes are full-on numerical analysts and know whether their particular calculations are stable in the presence of weakening of the IEEE 754 guarantees. In other words, that the final answer is within the error bounds given in the program's specification. Don't count on it. I fear what is more likely is that a few sample runs are made at double or quad precision and, if the answer does not change much, all is assumed to be OK.

My experience was even worse than this. We sometimes simply could not get bit-exact results out of the the same code on machines from two different vendors. An obvious, and quite likely, reason is that different stack layouts caused improperly initialised local variables to take on different values. But nothing showed up at compile time, and the results were not "too far" off. Scientific production code is more characterised by urgency than by formal correctness, and checks are likely to be based on reasonable consistency with previous versions.

Remember the old broken FORTRAN scalar product (of three-dimensional vectors) example that may still be lurking:

      DO 10 I=1. 3
      A(I) = B(I) * C (I)
10    CONTINUE
Even now, it's not obvious how we could find the problem by asking a model checker to verify that the code is SO(3) invariant. How about m3m? I can actually see how to check that but, in an odd corner of the code, the error might go unnoticed for years.

While we are descending into obscurity, can I draw attention to some work by my ex-colleague Pawel Sobolinski on Graphical Linear Algebra: https://graphicallinearalgebra.net/. See episode 26. It turns out that, if you give up a little associativity and a little commutativity, you can make rational arithmetic complete. You just need three extra numbers:
∞    ⊤     ⊥
That's far more parsimonious than IEEE754. The intuition is that the first is infinity; the others correspond to "no value" and "any value".

NB: To guarantee 0(3), the obvious approach would to be to show that all the coordinate arithmetic can be rewritten in terms of the invariant tensors δij and  εijk . The code above can't be rewritten in that way, at least not if you have good eyesight.

Best wishes,
Denis

On 25/11/2020 10:42, Roger Shepherd wrote:
Denis, 

Regarding bit-exactness.

On 25 Nov 2020, at 09:28, Denis A Nicole <dan@xxxxxxxxxxxxxxx> wrote:

3. In most systems, the real load is taken by the floating point units.  The IEEE standard is important here for several reasons.

  1. Floating point arithmetic is famously not associative. This heavily restricts the optimisations which can be performed while retaining bit-for-bit identical results. You either accept that the answers can change, write your code very carefully to pre-implement the optimisations, or go slow.
Non-associativity is a problem. It’s why these two pieces of code can give different results for the value sum

sum = 0.0
for i = 0 for x.size
    sum = sum + x[I]

and

sumFirstPart = 0.0
for i = 0 for x.size/2
    sumFirstPart = sum + x[I]

sumSecondPart = 0.0
for i = x.size/2 for (x.size - x.size/2) 
    sumSecondPart = sumSecondPart + x[I]

Sum = sumFirstPart + sumSecondPart

The second code segment shows exactly what you want to do to perform the two partial sums in parallel. I believe that in some real world systems the decisions about parallelisation get made at run-time, depending on what the computer is doing at the time, and so different runs on identical data give rise to different results.


4. Getting bit-for-bit matching answers from consecutive runs is really difficult. Obviously, we need to seed our PRNGs consistently, but there are all sorts of internal code and optimisation issues that break things. This leads to real difficulty in verifying benchmark tests. Overlaid on this are sometimes genuine instabilities in important scientific codes. For example, global ocean models can be very hard to "spin up"; you need exactly the right initial conditions or the circulation never converges to that of the planet we know. This may not even be a problem in the models; perhaps the real equations of motion have multiple fixed points? There are similar difficulties in molecular dynamics around hydrogen bonding. Sadly, that is the case we care about most; it covers protein folding in hydrated systems.


The non-associativity of f.p. arithmetic is the cause of many problems. Is the repeatability problem you mention due to effects other than this?

Roger