[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Go - deadlocked processes causing memory leaks



Roger,

On Mar 30, 2015, at 8:23 AM, Roger Shepherd <rog@xxxxxxxx> wrote:

> Larry
> 
>> 
>> I’ve always thought InputOrFail and brethren were very important,
> 
> I think they are
> 
>> and that the theory needed to be expanded to try to embrace those cases,
> 
> but I think this expansion is tricky.
> 
>> for reasons like those in this thread. In my book Crawl-Space Computing, I worked on one case (orchard sensors) that had failing connections, and found I had to carry it a little farther (p 133):
>> 
>> PRI ALT
>>  bool1 & c1 ? mess1
>>    — code
>>  — more channel guards if needed
>>  clock ? AFTER timeout
>>    — code, e.g. timedout := TRUE
>>  FAILURE
>>    — code, e.g. notdone := FALSE
>> 
>> where c1 ? mess1 is really an InputOrFail and branches to FAILURE if aborted. The reason it works is that once an inputting link channel guard is selected, and before its communication is done, its process’s address remains in its channel word and can be recovered. If the first, “unsolicited” byte has not been sent, or has not finished coming, then the channel’s branch of the ALT is not ready, and the timer branch wins. So this makes the ALT bulletproof.
> 
> I think the C1 ? mess1 has to be an InputOrFail. The problem is that if the first byte arrives correctly, but there is a problem before the end of the message, an input may hang. The presence of the first byte will trigger the selection mechanism of the ALT and an input (c1 ? mess) will be performed; it is this input which can fail, and once the input instruction has been executed the rest of the ALT has already been cleaned up, and hence there is no pending timeout. 

You have basically restated what I said. I’m sorry if I was not clear! Everything depends on the ALT being able to conclude even if the c1 input hangs. It can hang in one of two ways: during the unsolicited byte, or after the unsolicited byte. (The code executed between the full reception of the unsolicited byte winning the ALT and the c1 ? mess1, which is really an InputOrFail, is completely local to the ALTing process, and therefore this communication cannot hang there.) So both cases are covered.

> 
>> 
>> Larry Dickson
> 
> 
> I have two concerns. 
> 
> The first, which I’ve mentioned before, is that a “failing” communication undermines the programming model. Dealing with failing communication is a bit like dealing with failing variables. How might a variable fail? As far as I can see, the failure of a variable can only be the failure to yield the correct (last written) value when read; this is like a communication occurring but the wrong value being received. I think we have to assume our variable work (although perhaps accessing the value of a failed variable could be mapped into STOP?).  In a communication we can take steps to ensure that the data passed is correct (error correction) although there are implementation issues (see next point). This leaves us with the nasty property of the failing communication that there is a breakdown of synchronisation. Whilst I think InputOrFail etc. provide a fairly neat way of localising failure, it is by no means perfect. For example, consider
> 
>    CHAN OF [5]BYTE c :
>    PAR
>         InputOfFail.t(c, buffer, t, failedInput)
>         OutputofFail.t(c, “Hello”, t, failedOutput)
> 
> There is no need for failedInput and failedOutput to have the same value - which seems a little strange. Perhaps something can be done to make these operations semantically clean, and to make the compose properly, and….. 

I think the practical key is non-hanging, i.e. the process concludes. A process that concludes due to a failure of communication is a destroyed process (it cannot fulfill its programming), so I suppose it is a STOP. In the case you just described, one of the PAR members does a STOP, and the other goes on until it either terminates or is also STOPped by a later attempt to communicate with the one which STOPped earlier. This sort of thing was envisaged in the old occam/Transputer design, though I can’t finger the reference  at the moment; the alternative (on a B008) was that any process STOPping would raise the error flag, kill all processes, and trigger immediate post-mortem debugging.

> 
> Which brings me to my second point. Implementation. In occam we chose not to have output guards because of the cost of implementation (notwithstanding whether we could have come up with a “correct” implementation). InputOrFail etc are significantly more expensive than ? and !. I have a concern that any scheme to make the semantics of InputOrFail etc “nice” will carry a heavy penalty. 

My scheme is completely trivial (just a new keyword, FAILURE, and an assembly code branch following the InputOrFail abort). Output ALTs were avoided because of the “after you” problem if an ALTing input tries to communicate with an ALTing output, and also because they are not needed, since you can always send a request byte the other way and use an input ALT. Ultimately, it’s because somebody has to make an unsolicited commitment, and the Transputer just went with the first output byte (i.e. it’s send - ACK and not REQ - receive).

> 
> So what can be done? I suspect the way to approach this is to use a language which has  a higher level of abstraction and allows interaction between concurrent entities to be treated at a higher level. For example, if the client-server relationship were captured in such language, then (i) more efficient implementation of clients-and-servers might be possible than can be achieved using occam, and (ii) it may be able to deal with “unreliable clients" (as seen by a server) and “unreliable servers” (as seen by clients) in a neat manner. I think the trick with this type of endeavour is to make choices which allow real problems to be solved neatly, while not providing too much generality. 
> 
> Roger

I’m leery of abstraction because it always commits to a preferred model and hides the underlying machinery, which means when you need a model different from the preferred one you are fighting hidden machinery. The occam-like machinery gets you to the place where you sanely discard the process. I think the notion of a STOPped process is sufficient for practical use, as long as you deal without kidding yourself with all the things that can STOP it, and have redundant code branches (I mean different-in-kind) that can deal with the failure. People in practical situations deal with this all the time (your car stalls, you call a taxi). Think about outer space, where a cosmic ray comes along and clobbers part of a chip.

Larry

> 
> 
> 
> 
> 
> 
> --
> Roger Shepherd
> rog@xxxxxxxx
>