[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Scientific processors
- To: Roger Shepherd <rog@xxxxxxxx>
- Subject: Re: Scientific processors
- From: Denis A Nicole <dan@xxxxxxxxxxxxxxx>
- Date: Wed, 25 Nov 2020 12:24:16 +0000
- Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=ecs.soton.ac.uk; dmarc=pass action=none header.from=ecs.soton.ac.uk; dkim=pass header.d=ecs.soton.ac.uk; arc=none
- Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=+s2HHuHAH/NOMdZu6742eWilytBya0UK20gERecwXFY=; b=cCiFE9/84YXix6PDMEQyEPS4WdpPCRXzrYcP5TNCVcohuUjyfmVeAyiVfKlVoKdYx+vOERygLyjt4qx/YES1osFOByViYn6j6bLDo2uz6pS5Ex3cHwJXLVxQAUoX4yA3wC8R8xIWZsgiG/Yg0a/kpcetsU+mCpoK9GXO3Py//JyLiRJ8MhwvvEXXYyEikjyAlCYBe5c0I5eo9qO9OmU5k3Ah79MLh4CEQDx9ea5i8dcqb2VD/DuVD65fAU/FD3WBUc6NO2MK93uo/EU1lKDCGDNy3i28zZAJVKjXdyaACJu9obWJmeHPguny1Z+faX7tP68muYToZE2kxYbYpoTrbw==
- Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=l2Yl7sUR1aighMdR9L6Fjo7BnKO41mTTxPnJhzFJ0WxO6pSqPSUOVXIMZyfDBPasuhRhuNpJyL/XXvcGt+XMdubPB2lJkX0TegHpW0iFcD3hiq6ILlDq3pi3BE+88IkQHom8HPri91b7N/RSmUz1KSa1w3barUIUjSN/8l9hbG9ihE3DlFa0YnVU0Rq72pmuYJ//aWhGmcCMQ5876Uh6WO/5fPNORVRRyXdITxX5Fr+0BYbNCZ0z5fqVxXW2wq/x4ovxVKm7GxsYHd1nDAVy0q1qrqdsAnVlduK155FIe+t7VUOXlgzwpdTgzXL3SQEDwnhT/r77WDO6nCSWprakzQ==
- Archived-at: <https://lists.kent.ac.uk/sympa/arcsearch_id/occam-com/2020-11/9a91f472-bae2-fd2c-bd53-9018e19dd706%40ecs.soton.ac.uk>
- Authentication-results: googlemail.com; dkim=none (message not signed) header.d=none;googlemail.com; dmarc=none action=none header.from=ecs.soton.ac.uk;
- Cc: Larry Dickson <tjoccam@xxxxxxxxxxx>, Øyvind Teig <oyvind.teig@xxxxxxxxxxx>, Tony Gore <tony@xxxxxxxxxxxx>, Ruth Ivimey-Cook <ruth@xxxxxxxxxx>, occam-com <occam-com@xxxxxxxxxx>, Uwe Mielke <uwe.mielke@xxxxxxxxxxx>, David May <David.May@xxxxxxxxxxxxx>, Michael Bruestle <michael_bruestle@xxxxxxxxx>, Claus Peter Meder <claus.meder@xxxxxxxxxxxxxx>
- Delivery-date: Wed, 25 Nov 2020 12:24:41 +0000
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sotonac.onmicrosoft.com; s=selector1-sotonac-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=+s2HHuHAH/NOMdZu6742eWilytBya0UK20gERecwXFY=; b=pWhaDjdr4v2RoAwtZHQ4xIzj8Gy8quyKxkLkrxk/1XbZtuaXObhVgICECoJ/NkeksGftNWify87ynvu1mSJHcqtM1L2Ei1vcSy2pxUJXpAf1Y+4Td/UEqw52IfMZxbSjdRKiHImkTznQ2z2aoPPqlPerDekCBEgC/6Gna1HJExc=
- Envelope-to: ats@xxxxxxxxx
- In-reply-to: <D14FBD38-1CA4-4CCD-A4CD-E8AD11DF3C45@rcjd.net>
- List-archive: <https://lists.kent.ac.uk/sympa/arc/occam-com>
- List-help: <mailto:sympa@kent.ac.uk?subject=help>
- List-id: <occam-com.kent.ac.uk>
- List-owner: <mailto:occam-com-request@kent.ac.uk>
- List-post: <mailto:occam-com@kent.ac.uk>
- List-subscribe: <mailto:sympa@kent.ac.uk?subject=subscribe%20occam-com>
- List-unsubscribe: <mailto:sympa@kent.ac.uk?subject=unsubscribe%20occam-com>
- References: <C873035A-2E4D-4079-A7BA-D02635B6558E@tjoccam.com> <EF490EDC-611F-44BD-879B-95923FB47496@teigfam.net> <6364CB26-883F-4B91-88B7-997DDCC49760@teigfam.net> <6A533325-949F-425A-9A3B-0400B3CE4F7D@tjoccam.com> <VI1PR05MB5903F0DB8738AF40CABF60D3E0FF0@VI1PR05MB5903.eurprd05.prod.outlook.com> <C9038418-4AD5-4DB3-A7AC-2C9799242792@tjoccam.com> <6E379737-0600-45AF-BFC5-073A3526C2C2@rcjd.net> <F881DCB9-A9EA-468F-88E3-CDA3CB457FBF@tjoccam.com> <DBE32399-FF55-4F22-A847-A37F2E5DF3C1@rcjd.net> <2222a3c3-4e6f-cfee-e5ce-24c65b1ee06d@ivimey.org> <05D297F6-7933-4349-876B-573D7A26D1DD@tjoccam.com> <VI1PR05MB590305A50446979450C1E584E0FB0@VI1PR05MB5903.eurprd05.prod.outlook.com> <82C57044-03D3-4894-A0F5-7A6A8FACE183@teigfam.net> <15371949-5B0F-47DB-BD26-DDE91EE9ED85@tjoccam.com> <22F803FF-B74C-4CA0-B0F5-299C79A95DE3@rcjd.net> <A8FEFDFC-CFD6-4959-864D-572C371CCA68@tjoccam.com> <ecae4c9a-a974-e22f-0472-da0bac10f948@ecs.soton.ac.uk> <D14FBD38-1CA4-4CCD-A4CD-E8AD11DF3C45@rcjd.net>
- Reply-to: Denis A Nicole <dan@xxxxxxxxxxxxxxx>
- Sender: occam-com-request@xxxxxxxxxx
- User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.5.0
Hi Roger,
Good to hear from you.
Yes, non-associativity is a real problem, but I think most of the
practical issues are local to a processor. In practice, if we are
doing a distributed "reduce", it is quite likely that the MPI code
in the application will impose a deterministic ordering.
On the other hand, we can encourage the compiler to reorder FP
operations with, for example, the GCC option -funsafe-math-optimizations.
That
may make bad things happen, but they will happen more quickly.
We might hope that the developers of scientific codes are full-on
numerical analysts and know whether their particular
calculations are stable in the presence of weakening of the IEEE
754 guarantees. In other words, that the final answer is within
the error bounds given in the program's specification. Don't count
on it. I fear what is more likely is that a few sample runs are
made at double or quad precision and, if the answer does not
change much, all is assumed to be OK.
My experience was even worse than this. We sometimes simply could
not get bit-exact results out of the the same code on machines
from two different vendors. An obvious, and quite likely, reason
is that different stack layouts caused improperly initialised
local variables to take on different values. But nothing showed up
at compile time, and the results were not "too far" off.
Scientific production code is more characterised by urgency than
by formal correctness, and checks are likely to be based on
reasonable consistency with previous versions.
Remember the old broken FORTRAN scalar product (of
three-dimensional vectors) example that may still be lurking:
DO 10 I=1. 3
A(I) = B(I) * C (I)
10 CONTINUE
Even now, it's not obvious how we could
find the problem by asking a model checker to verify that the code
is SO(3) invariant. How about m3m? I can actually
see how to check that but, in an odd corner of the code, the error
might go unnoticed for years.
While we are descending into obscurity,
can I draw attention to some work by my ex-colleague Pawel
Sobolinski on
Graphical Linear Algebra: https://graphicallinearalgebra.net/.
See episode 26. It turns out that, if you give up a little
associativity and a little commutativity, you can make rational
arithmetic complete. You just need three extra numbers:
∞ ⊤ ⊥
That's far more parsimonious than
IEEE754. The intuition is that the first is infinity; the others
correspond to "no value" and "any value".
NB: To guarantee 0(3), the obvious
approach would to be to show that all the coordinate arithmetic
can be rewritten in terms of the invariant tensors δij
and εijk . The code above can't be
rewritten in that way, at least not if you have good eyesight.
Best wishes,
Denis
On 25/11/2020 10:42, Roger Shepherd
wrote:
Denis,
Regarding bit-exactness.
3. In most systems, the real load is taken
by the floating point
units. The IEEE standard is important here
for several reasons.
- Floating point arithmetic is famously
not associative. This heavily restricts the
optimisations which can be performed while
retaining bit-for-bit identical results. You
either accept that the answers can change, write
your code very carefully to pre-implement the
optimisations, or go slow.
Non-associativity is a problem. It’s why these two
pieces of code can give different results for the value
sum
sumFirstPart = sum + x[I]
for i = x.size/2 for (x.size - x.size/2)
sumSecondPart = sumSecondPart + x[I]
Sum = sumFirstPart + sumSecondPart
The second code segment shows exactly what you want
to do to perform the two partial sums in parallel. I
believe that in some real world systems the decisions
about parallelisation get made at run-time, depending
on what the computer is doing at the time, and so
different runs on identical data give rise to
different results.
4. Getting bit-for-bit matching answers from
consecutive runs is really difficult. Obviously, we
need to seed our PRNGs consistently, but there are all
sorts of internal code and optimisation issues that
break things. This leads to real difficulty in
verifying benchmark tests. Overlaid on this are
sometimes genuine instabilities in important
scientific codes. For example, global ocean models can
be very hard to "spin up"; you need exactly the right
initial conditions or the circulation never converges
to that of the planet we know. This may not even be a
problem in the models; perhaps the real equations of
motion have multiple fixed points? There are similar
difficulties in molecular dynamics around
hydrogen bonding. Sadly, that is the case we care
about most; it covers protein folding in hydrated
systems.
The non-associativity of f.p. arithmetic is the cause
of many problems. Is the repeatability problem you mention
due to effects other than this?
Roger