[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Something for the weekend ...



Dear All,


WHY DO STUDENTS BREAK THINGS?
=============================

A final-year student, working on a project with our CSP-for-Java (JCSP)
library, broke it a few weeks ago :-( ...

The good news is ... we fixed it ;-) ... but it was agonisingly hard to do.
Or have we really fixed it?  After all, we thought it was OK before.

It makes an interesting case study, so I'm setting it as a challenge for
your idle hours.

The student (Neil Fuller) came up with a deadlock in his application that
he claimed was not his fault.  He was right.  Distilling his system down
reduced it to a simple stress-test on the JCSP Alternative (i.e. ALTing)
that deadlocked.  I went back to the original Java algorithms for the ALT:

  http://www.hensa.ac.uk/parallel/groups/wotug/java/discussion/4.html

This was published in October, 1996, following up discussions arising from
the Java Threads Workshop the month before.  Mapping the stress-test to use
this code (i.e. just using Java Threads directly - no JCSP packages), it
also failed :-(

The stress-test is simple: two threads (A and B) bombard two channels with
ints in a tight loop (the Channel in the above web page just carried ints).
There is a receiving thread (R) that `fair'-ALTs on the two channels, also
in a tight loop:

     ______________         ______________         ______________
    |              |       |              |       |              |
    |              |   a   |              |   b   |              |
    |      A       |--->---|      R       |---<---|      B       |
    |              |       |              |       |              |
    |              |       |              |       |              |
    ----------------       ----------------       ----------------

The select method in the Alternative class (from the above web page) is
supposed to be a PRI ALT.  To get a `fair'-ALT, unroll the body of the
loop in R so that it does two selects (PRI ALTs): the first gives priority
to channel `a' and the second to channel `b'.

What should *not* happen is that the sytem deadlocks.  Well, it runs for
a while ... sometimes several minutes ... and then it deadlocks ...


WHY DIDN'T WE STRESS TEST IT ORIGINALLY?
========================================

I can't remember, but I think we must have done.  But in 1996, we weren't
using a JIT so a stress test wouldn't really have stressed it and the race
hazard wouldn't have shown up.  The JCSP/CTJ libraries then got built and
are certainly used with JITs - but till now, none of our JCSP applications
really stressed it.


WHAT WENT WRONG?
================

The web page was published long ago (2nd. October, 1996).  The algorithm
tries to be a Java rendering of the algorithm implemented by the transputer.
It can't be exactly the same, because of certain aspects of Java, but was
as close to it as could be made (or, at least, that's what I tried to do).

Prior to appearing on that web page, it was posted to java-threads and
occam-com.  The algorithm has been presented at seminars around the world
and in private conversations with all sorts of clever people.  I never had
any problem in explaining how it was a watertight implementation of the ALT.
The algorithm was incorporated into the JCSP library and has worked
successfully for a couple of years ... until a couple of weeks back!
The algorithm was also (I think) incorporated into Twente's CJT library,
where it's also been used without apparent failure (Andy/Gerald: look out!).

But the algorithm is wrong.  It has a race hazard.  Stress it and it will
deadlock.  Oh, dear!

The algorithm involves two monitors (classes): Alternative and Channel.
The Alternative has synchronized methods: select (public) and schedule
(of package visibility).  The Channel has four methods: read and write
(both public) and enable and disable (both package visibility).  The
read, write and enable methods are synchronized - the disable one is
not synchronized.  The reason given on the web page for not synchronising
the disable is valid (it would *quickly* deadlock otherwise).

There are several waits and notifies around the code.  The two classes
interact with each other in all sorts of interesting ways.

At WoTUG-21, I gave an informal presentation of a Hoare monitor alogorithm
that had just 4 (synchronized) methods plus various waits and signals.
That was pretty hard to follow!  I think these Alternative/Channel
interactions may be a tad harder.


MORAL:
======

Programming with synchronized, wait and notify is to be avoided at all costs!
Above an almost trivial level of complexity, we just plain get them wrong!!
Even when the code is a key element implementing a major product and we
stare at it and talk about it for years, we still get it wrong.  What if
we were building something safty-critical (like air traffic control)?

The hope is that we've got it right now for our CSP primitives and we will
never be bothered by them again.  We can use the CSP stuff instead - which
is something about which we *can* reason and which remains reasonable with
increasing complexity of application.


SO WHAT WENT WRONG?
===================

I'm not going to tell you ;-)

At least, not until some (probably) late night session at WoTUG-22 at Keele
next month ... have *you* booked your place yet?!!

Meanwhile, it is an exercise for the reader.  Print out the web page.  Stare
at the Alternative and Channel classes.  Spot the race hazard!

Hint: without formal methods, I bet you won't spot it!  Instead, build the
stress-test outlined above.  Run it and watch it deadlock.  Get your JVM
to dump a stacktrace for all its (3) threads (CTRL-\ for JDK on Solaris
or CTRL-Break on Windows may do the trick ...).  Look at the stack traces.
The deadlock is obvious, but how did those threads get into those states?
The Alternative/Channel methods were not supposed to let that happen!
[Note: this took us an hour or so.]

Once you've found the deadlock and worked out the race hazard that lets it
happen, fix the algorithm so that it doesn't.  [Note: this took us about 5
days ...]


HOW DO WE KNOW WE'VE GOT IT RIGHT NOW?
======================================

We don't.  I'll publish the new algorithm at WoTUG-22.  Please would someone
then prove it correct!!!  Or prove it wrong, correct it and prove that one
correct!

A problem lies in that there are no formal semantics for Java's synchronized
keyword, nor for its wait and notify methods.  They must have some semantics,
because we don't seem to have any problem informally explaining what they do.
Indeed, a problem with these things is that it's *easy* to explain what they
do.  Great, threads seem so simple in Java - at least, the primitives seem
simple enough.  It's when we use them in combination that we lose all sense
of what we are doing.

To formalise these things, build CSP models for them.  Get some consensus
that these models do reflect the informal Java semantics.  Then, we can
start proving things (with respect to the assumption that the CSP models
are good ones).  [I can see(ish) how to map sync and wait into CSP ...
the notify beats me though ... ???]

Then, to prove the revised Alternative/Channel implementations correct,
turn them back into CSP (assuming we have CSP models for sync/wait/notify).
We would now have an implementation in terms of CSP primitives.  Prove
that that implementation equals an ALT.


Have a good weekend,

Peter.