RE: Race condition...

Good that you bring up TTP. It is pushed hard as being THE solution to realtime and safety by Prof. Kopetz, almost religiously.

It has the merit that one can calculate up front the timings. But the issue is that it has no graceful degradation. You miss the deadline with 1 microsec and TTP says “fatal error”. Most systems can easily tolerate such deadline misses IF they happens sporadically. In the case of brakes, they have a mechanical inertia in milliseconds. So a microsec more or less will not make a difference at all. In the worst case, have a bit less of QoS.

Which means that a system that can tolerate a few deadline misses could be safer than one that was designed under the assumption that you can never miss any.

Eric

From: Mailing_List_Robot [mailto:sympa@xxxxxxxxxx] On Behalf Of Tony Gore
Sent: Wednesday, March 31, 2010 11:26 AM
To: eric.verhulst@xxxxxxxxxxxxxxxxxxxxxx; Ian East
Cc: Jones, Chris C (UK Warton); occam-com@xxxxxxxxxx
Subject: RE: Race condition...

Hi Eric

Some years ago I reviewed some developments on time triggered architectures for anti-lock braking. Here each wheel acted independently. I did some calculations on taking the bus time interval and how far a car would travel on a high speed bend on the autobahn at 160km/h and it was around 1.5m. I then considered what would happen on black ice if the brakes were applied to one or opposite wheels by this time interval and decided that I would not want to be in that car as this would probably start an uncontrollable spin.

This is one example of the difference between the continuous domain and the discrete domain.

Default actions can be quite tricky to define. When I worked at Motorola, we had the "Motocar". This was a Lancia Delta wher we took out all the wiring (1.5km) and put a power bus and a fibre optic control bus round the car. There were then about 30 black boxes e.g. one at each light cluster. The question of default actions e.g. when communications failed became interesting. Lights are easy - you put them on, as it is no real problem to have lights on during the day, but hard and illegal to drive without lights at night. Same with the fuel pump. The biggest question is what to do with the windscreen wipers - default on or off?

By the way, it was fun driving such a car in failure mode, as there were no turn signals, and the UK highway code never gave hand signals for a left hand drive car turning right and unable to open the electric windows!

Tony Gore

email tony@xxxxxxxxxxxx
tel +44-1278-761000 FAX +44-1278-760006 GSM +44-7768-598570
URL: www.aspen.uk.com
Aspen Enterprises Limited
Registered in England and Wales no. 3055963 Reg.Office Aspen House, Burton Row, Brent Knoll, Somerset TA9 4BW. UK

From: Eric Verhulst [mailto:eric.verhulst@xxxxxxxxxxxxxxxxxxxxxx]
Sent: 31 March 2010 10:06
To: 'Ian East'; Tony Gore
Cc: 'Jones, Chris C (UK Warton)'; occam-com@xxxxxxxxxx
Subject: RE: Race condition...

Hi Ian,

I might have a text (or at least the basis) for your last book. It is an email I send to the Formal Community after FM2009, picking up a leaflet on a survey of formal methods.

I also did some investigation in car safety, and the issues are much wider than adding the right IF statements. The root cause is incomplete requirements analysis to start with, but mainly that the automotive people have not yet understood the difference between the continuous and the discrete domain. Mechanical parts are in the first domain and have a natural “graceful degradation” property. Electronics and software are in the second domain and can stop working in one clock cycle (say 10 nanoseconds). Even with a very low probability of failure at the hardware level, after some time in the human domain (e.g. driving a car for 5 years), it is a certainty that it will happen. Even using a dual-core controller (like TI has one with ARMs) will not cover the common mode failures. Hence our controller will be heterogeneous.

My conclusions is that subsystems like traction control, braking, etc need to be SIL4 => fault tolerance through triplication (pedal, sensors, CPU, SW, …) and voting. If this fails your are at SIL3 (fail safe mode), which is now the “standard” practice => often it means they put the engine at high throttle. Still very risky at high speed on a highway. Humans need a bit more than 20 nanosecs to take corrective action.

Best regards,

Eric

See below the mail I sent (to John Fitzgerald).

Dear FM authors,

I picked up the leaflet on the FM report at FM2009, read your excellent report, and while I contributed to the survey, I believe that the kind of experience we have with using FM is a bit missing from the report.

In 2005 we started a project (with support from the Flemish IWT Gov organization) to develop a distributed RTOS (OpenComRTOS) from scratch but using FM from the beginning. (this was under the banner of Open License Society). The purpose twofold:

1. To learn to know the practical use of Formal Methods. (we had no prior experience, just awareness, mostly of CSP).

2. To develop a “correct” (read error-free and clean) networked RTOS. There was prior art. Virtuoso was a similar RTOS, developed by Eonic Systems in the 90’s and sold to Wind River in 2001.

The team was composed of three people:

- Myself, mainly as project manager and RTOS architect.

- Gjalt de Jong, PhD in electronics but well versed in telecom and software implementor.

- Jose Faria, an engineer, electro-mechanical engineer: he was our formal methods engineer (no prior experience).

The FM work was supervised by Prof. Raymond Boute.

The first 6 months were mostly spend getting familiar in a rather informal way with what FM were, looking at tools, reading some books and papers, and finally selecting a tool.

We originally had planned to use SPIN (as Promela is C-like), but Prof. Boute suggested to use TLA+/TLC. While we were originally less enthousiastic about the mathematical notation, in the end this might have been beneficial because it raised the level of abstraction, expecially in the way of thinking about the problem.

The way of working was that only Jose worked full time (although he left after one year). Gjalt and myself worked in parallel at a company managing embedded software development but with freedom to work on our RTOS project when we wanted. The way of working was mainly by email (Jose was at another location) and by meeting face-to-face every 2 to 4 weeks. Mainly discussing using a white board.

So what happened?

First of all, the original TLA models were useless. It turned out that we were still thinking in terms of (C) implementation. E.g. when a waiting list was to be modeled, Jose initially wrote TLA code mimicking a linked list with pointers, etc. Turned out that the models then exploded very fast and became rather complex. Until we realized this and simple used a set. Another example was the discovery that waiting lists and buffers were actually the same thing, etc.

So we started again and then quickly as well incrementally build up the models. Not of the full RTOS (that is beyond the capabilities) but of the core functionality, like scheduling, semaphores, events FIFOs, priority inheritance, etc. Initial models were just 2 pages, but after some 25 iterations they could grow to a 20- 30 pages. However, between each iteration, often only 30 minutes was needed. We ran the model until no errors were found and then refined it, ran it again, etc. Running the models was in minutes. I must add that the interactions of the three people involved were very stimulating and helped a lot because each of use looked at the problem from a different angle.

At some point, we discovered that most traditional RTOS services are actually variations on a theme. Basically each service has a synchronisation point, a synchronization predicate and a synchronization action. We called this generic entity a “hub”. Another novelty was that we had consistently designed the RTOS as a packet switching system (in a very systematic way).

This process lasted about 12 months. Two weeks later we had running code. This code was derived by hand from the models that served as architectural design. Since, we have found errors, but most of them were related to the underlying “HAL”, either an OS like Windows of Linux, either a specific embedded processor. Noteworthy is that doing this we discovered two serious “bugs” in Windows and Linux. In Windows the “CriticalSection” system call crashed when stressed, so did the Linux semaphore.

The major surprise however was the code size. Compared with Virtuoso we have 5 to 10 times less code for more or less the same basic functionality. This was even more surprising as Virtuoso had a very good reputation for size and performance, it was hand tweaked, assembler optimized, 3 generations of the RTOS were developed and about 150 person-years went into it. It was also very hard to port. OpenComRTOS on the other was almost fully written in ANSI-C (Misra checked), easily ported and much more scalable as well as safer. Code size figures: typically 5 to 10 Kbytes for a full version (MP). An optimized SP version has been compiled to less than 1 KB. The Virtuoso RTOS typically started at 5 Kbytes (just task scheduling) and easily went up to 70 – 80 Kbytes (typically on a 32bit machine).

Once the software was stable, new Formal models in TLA were developed from the implementation source code. This was quite straightforward.

Lessons:

1. FM methods are not that difficult, but they must be used as a supporting tool from the beginning when developing the architecture and even when narrowing down the specifications.

2. One doesn’t need large teams and efforts to use them, although dedication is a must.

3. The biggest gain is in the architecture.

4. Once a software has been developed with FM, verification (using model checking) is easy.

What we did not do, is to verify the C code using static checkers or proofing tools. This is probably more appropriate for numerical code.

This brings me to a final conclusion. The current dominant attitude in the FM community is still that FM are to be used to verify software. The problem is that this can still be badly designed and inefficient software. Hence the verification effort will also be much larger than needed. FM methods really shine when used as a tool and method supporting abstract thinking, together with informal techniques, pseudo-formal techniques (like architectural diagrams), etc. I recently attended a presentation by MS on the Hypervisor verification (60 000 lines of C ). I was astonished to hear that each C function is more or less manually proofed with as only input for the formal engineer the C code! I presume that if this Hypervisor had been developed using FM from the start, it could have been a lot smaller and likely a lot cleaner and more performant.

Just image that all embedded software had been developed with FM resulting in code sizes that are maybe 20% of what we have now. The world would have been different. It would also have meant a serious reduction in power consumption. It is also my opinion that with a well trained team and using such “formalized” methodology, projects don’t need to take more effort that the still dominant manual methods. Hence, it can really a cost reduction and certainly a serious risk reducing approach (the traditional methodology is very prone to run-away after the project was officially finished).

Best regards,

Eric Verhulst

1. In the meantime we have started to work on the time-out functionality. Here using TLA is problematic, so we switched to UPPAAL. We are also starting to use Frama-C and B.

2. A paper was published at FM 2008 on this project.

From: Mailing_List_Robot [mailto:sympa@xxxxxxxxxx] On Behalf Of Ian East
Sent: Wednesday, March 31, 2010 10:32 AM
To: Tony Gore
Cc: Jones, Chris C (UK Warton); occam-com@xxxxxxxxxx
Subject: Re: Race condition...

Hi Tony

Thanks for this. Very worrying, given I drove my family around Florida recently in a Prius!

It's interesting how simple basics can get drowned out. I suppose it happened to NASA with Challenger.

One of the things that attracts me to rail, over road, (forthcoming book The Cost of the Car) is the engineering heritage. Safety engineering began with rail, and signalling makes a nice student exercise in sequential system design, and, more recently, in formal analysis.

I worry whether the same art is being applied to car design. Of course, it is entirely absent in road design, which is inherently lethal anyway, as the annual carnage shows.

And, of course, it is absent in the training of most 'software engineers', as is engineering per se.

I too began with µprocessor programming in assembly language, and believe as much has been lost as gained in the use of 'standard' HLLs. I have had unending difficulty adapting to universal hacking – the lack of any real specification, design, or analysis, with software.

I'm trying to assemble a series of texts called Electronics, Systems, and Programming (http://www.openchannelpublishing.com/esp.html) that combine a more formal approach to software, and would welcome a proposal, particularly wrt embedded concurrent/reactive systems. (You get £2.50 back a book, instead of 80p at most, and a lower selling price, hence more sales. And yes, that is realistic, especially with e-books on the rise.)

The CPA community is still badly in need of books that reflect its idea and ideals. The right people are always too busy, I guess.

Best

Ian

On 31 Mar 2010, at 03:04, Tony Gore wrote:

Hi Ian

See this

http://www.edn.com/blog/1700000170/post/1760052976.html?nid=3351&rid=8414203

from an engineer who used to work at Ford in around the same era that I was working in automotive. The key phrase he uses is

The system level error that Toyota made is not letting a brake signal override a throttle signal.

Although not common over here in the UK because of our penchant for stick shift (manual gearboxes), most cars in the US have automatic gearboxes and cruise control. One of the standard signals to cruise control is the brake signal to disengage it.

Tony Gore

Ian East

ian.east@xxxxxxxxxxxxxxxxxxxxxxxxx

Open Channel Publishing Ltd.

(Reg. in England, Company Number 6818450)