[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Scientific processors



This email may contain proprietary information of BAE Systems and/or third parties.

 

Dennis,

 

Our simulation codes (still mostly written in Fortran but with bits in C) do not produce anything like the same numbers on different hardware as we find out each time we benchmark for replacement HPC machines.  We find variations (in double precision) from the 5th or 6th decimal place.  We do not worry about that so long as the error gets no bigger, but we do run to 10s of millions of iterations to confirm this.  Generally, these all run using MPI (though various vendor versions) and mostly with the same compiler.

 

Also, because the problems are so large and complex (whole aircraft resolved down to 1 or 2 cms and in which the cables are about 10 times more complex than the structure of the aircraft – you do mention the complexity of the problem), and just to add a certain piquancy to the problem we are doing Electromagnetic interaction modelling requiring complex numbers and Maxwell’s coupled PDEs – the famous Curl equations, we do not assume correctness but validate the code (that it models the physics correctly) and validate the model – the representation of the aircraft – by simulating tested conditions and aiming to reproduce the test measurement results as closely as we can.  This is seriously non-trivial, but the unavoidable errors in the last part are, and must be far greater than discrepancies in the 5th significant digit. 

 

Currently we use a numerical method called Trnsmission Line Matrix or Method (TLM) which is provably stable as opposed to Finite Difference which will eventually go unstable, in a rigid Cartesian mesh.  We actually use a uniform mesh through the problem geometry and out into white space because the code runs this so much more efficiently.  We are working on unstructured codes with body-fitted cells and “true-route” representation of cables, initially for much finer modelling of bits of structure, meshing down to microns where necessary.  That’s why I am interested in the possibilities of more links.

 

So, yes we do accept these small errors, but we do make sure they are small.  We run several real test cases for which we have the best measurement data we can get.  We run them many times over for various numbers of cores, from 10s to thousands, we also run the codes using various numbers of cores per processor as we find the amount of available cache makes a huge difference to compute performance (see my paper at CPA 2013).  Dissimilarity in the numbers at the limit of precision is worrying in so far as it might be the tip of a scary iceberg, but the comparisons we eventually make will be in terms of dBs not nth significant digits.

 

Is this “real” engineering?

 

Regards,

Chris

 

PS.  I still yearn for a usable occam that I could use to create a solver giving me a combination of functional and domain decomposition.  And a Transputer …..

 

 

Prof. Christopher C R Jones BSc. PhD C.Eng. FIET

BAE Systems Engineering Fellow

EMP Fellow of the Summa Foundation

Principal Technologist – Electromagnetics                cid:image002.jpg@01CE4045.5F87A7D0

 

Military Air & Information                                                 ( Direct:   +44 (0) 3300 477425

Electromagnetic Engineering, W423A                           ( Mobile:  +44 (0)7855 393833

Engineering Integrated Solutions                                  7 Fax:       +44 (0)1772 8 55262

Warton Aerodrome                                                           * E-mail:   chris.c.jones@xxxxxxxxxxxxxx 

Preston                                                                              : Web:      www.baesystems.com

PR4 1AX

 

BAE Systems (Operations) Limited
Registered Office: Warwick House, PO Box 87, Farnborough Aerospace Centre, Farnborough, Hants, GU14 6YU, UK
Registered in England & Wales No: 1996687

Exported from the United Kingdom under the terms of the UK Export Control Act 2002 (DEAL No 8106)

 

From: occam-com-request@xxxxxxxxxx [mailto:occam-com-request@xxxxxxxxxx] On Behalf Of Denis A Nicole
Sent: 25 November 2020 12:25
To: Roger Shepherd
Cc: Larry Dickson; Øyvind Teig; Tony Gore; Ruth Ivimey-Cook; occam-com; Uwe Mielke; David May; Michael Bruestle; Claus Peter Meder
Subject: Re: Scientific processors

 


PHISHING ALERT

This email has been sent from an account outside of the BAE Systems network.

Please treat the email with caution, especially if you are requested to click on a link or open an attachment.
For further information on how to spot and report a phishing email please access the Global Intranet then select <Functions> / <IT>.

 

 

Hi Roger,

Good to hear from you.

Yes, non-associativity is a real problem, but I think most of the practical issues are local to a processor. In practice, if we are doing a distributed "reduce", it is quite likely that the MPI code in the application will impose a deterministic ordering. 

On the other hand, we can encourage the compiler to reorder FP operations with, for example, the GCC option -funsafe-math-optimizations.That may make bad things happen, but they will happen more quickly.

We might hope that the developers of scientific codes are full-on numerical analysts and know whether their particular calculations are stable in the presence of weakening of the IEEE 754 guarantees. In other words, that the final answer is within the error bounds given in the program's specification. Don't count on it. I fear what is more likely is that a few sample runs are made at double or quad precision and, if the answer does not change much, all is assumed to be OK.

My experience was even worse than this. We sometimes simply could not get bit-exact results out of the the same code on machines from two different vendors. An obvious, and quite likely, reason is that different stack layouts caused improperly initialised local variables to take on different values. But nothing showed up at compile time, and the results were not "too far" off. Scientific production code is more characterised by urgency than by formal correctness, and checks are likely to be based on reasonable consistency with previous versions.

Remember the old broken FORTRAN scalar product (of three-dimensional vectors) example that may still be lurking:

      DO 10 I=1. 3
      A(I) = B(I) * C (I)
10    CONTINUE

Even now, it's not obvious how we could find the problem by asking a model checker to verify that the code is SO(3) invariant. How about m3m? I can actually see how to check that but, in an odd corner of the code, the error might go unnoticed for years.

 

While we are descending into obscurity, can I draw attention to some work by my ex-colleague Pawel Sobolinski on Graphical Linear Algebra: https://graphicallinearalgebra.net/. See episode 26. It turns out that, if you give up a little associativity and a little commutativity, you can make rational arithmetic complete. You just need three extra numbers:

∞       

That's far more parsimonious than IEEE754. The intuition is that the first is infinity; the others correspond to "no value" and "any value".

 

NB: To guarantee 0(3), the obvious approach would to be to show that all the coordinate arithmetic can be rewritten in terms of the invariant tensors δij and  εijk . The code above can't be rewritten in that way, at least not if you have good eyesight.

 

Best wishes,
Denis

 

On 25/11/2020 10:42, Roger Shepherd wrote:

Denis, 

 

Regarding bit-exactness.



On 25 Nov 2020, at 09:28, Denis A Nicole <dan@xxxxxxxxxxxxxxx> wrote:

3. In most systems, the real load is taken by the floating point units.  The IEEE standard is important here for several reasons.

  1. Floating point arithmetic is famously not associative. This heavily restricts the optimisations which can be performed while retaining bit-for-bit identical results. You either accept that the answers can change, write your code very carefully to pre-implement the optimisations, or go slow.

Non-associativity is a problem. It’s why these two pieces of code can give different results for the value sum

 

sum = 0.0

for i = 0 for x.size

    sum = sum + x[I]

 

and

 

sumFirstPart = 0.0

for i = 0 for x.size/2

    sumFirstPart = sum + x[I]

 

sumSecondPart = 0.0

for i = x.size/2 for (x.size - x.size/2) 

    sumSecondPart = sumSecondPart + x[I]

 

Sum = sumFirstPart + sumSecondPart

 

The second code segment shows exactly what you want to do to perform the two partial sums in parallel. I believe that in some real world systems the decisions about parallelisation get made at run-time, depending on what the computer is doing at the time, and so different runs on identical data give rise to different results.

 

 

4. Getting bit-for-bit matching answers from consecutive runs is really difficult. Obviously, we need to seed our PRNGs consistently, but there are all sorts of internal code and optimisation issues that break things. This leads to real difficulty in verifying benchmark tests. Overlaid on this are sometimes genuine instabilities in important scientific codes. For example, global ocean models can be very hard to "spin up"; you need exactly the right initial conditions or the circulation never converges to that of the planet we know. This may not even be a problem in the models; perhaps the real equations of motion have multiple fixed points? There are similar difficulties in molecular dynamics around hydrogen bonding. Sadly, that is the case we care about most; it covers protein folding in hydrated systems.

 

The non-associativity of f.p. arithmetic is the cause of many problems. Is the repeatability problem you mention due to effects other than this?

 

Roger

********************************************************************
This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender.
You should not copy it or use it for any purpose nor disclose or
distribute its contents to any other person.
********************************************************************

BAE Systems may process information about you that may be subject to data protection
laws. For more information about how we use your personal information, how we protect
your information, our legal basis for using your information, your rights and who you can
contact, please refer to our Privacy Notice at www.baesystems.com/en/privacy