Denis,
Regarding bit-exactness. 3. In most systems, the real load is taken by the floating point
units. The IEEE standard is important here for several
reasons.
- Floating point arithmetic is famously not associative. This
heavily restricts the optimisations which can be performed while
retaining bit-for-bit identical results. You either accept that
the answers can change, write your code very carefully to
pre-implement the optimisations, or go slow.
Non-associativity is a problem. It’s why these two pieces of code can give different results for the value sum
sumFirstPart = sum + x[I] for i = x.size/2 for (x.size - x.size/2) sumSecondPart = sumSecondPart + x[I]
Sum = sumFirstPart + sumSecondPart
The second code segment shows exactly what you want to do to perform the two partial sums in parallel. I believe that in some real world systems the decisions about parallelisation get made at run-time, depending on what the computer is doing at the time, and so different runs on identical data give rise to different results.
4. Getting bit-for-bit matching answers from consecutive runs is really difficult. Obviously, we need to seed our PRNGs consistently, but there are all sorts of internal code and optimisation issues that break things. This leads to real difficulty in verifying benchmark tests. Overlaid on this are sometimes genuine instabilities in important scientific codes. For example, global ocean models can be very hard to "spin up"; you need exactly the right initial conditions or the circulation never converges to that of the planet we know. This may not even be a problem in the models; perhaps the real equations of motion have multiple fixed points? There are similar difficulties in molecular dynamics around hydrogen bonding. Sadly, that is the case we care about most; it covers protein folding in hydrated systems.
The non-associativity of f.p. arithmetic is the cause of many problems. Is the repeatability problem you mention due to effects other than this?
Roger
|