RE: Occam-Tau - the natural successor to Occam-Pi

Hi David,

You surely must be joking.

You are the one who gave the world essentially an unlimited number of processes to program with. On the transputer, even a single assignment could be a PAR process. This was great and feasible because the context switch was light. It changed the way we thought about programming. (All I have done since then has evolved from there). On modern CPUs today, the grain size can't be that small because the context switch is a bit heavier, but the CPUs are now faster. Now you are pleading again for dedicated hardware threads (max 4 @ 100 or max 8 @ 50 MHz on the XMOS core). Essentially, we are back to the superloop state machine. The XMOS core has a maximum of 8, all sharing 64K of SRAM, and if one needs a simple peripheral like a UART one or two threads are used up even if the CPU load is low. Modern microcontrollers use a dedicated small IP block for this. Way cheaper and easier to use.

As to the number of threads (tasks is a better word), any typical Real Time application has tens of them, talking to several (asynchronous) peripherals. Granted, most of the COTS RTOS (typically POSIX based) waste resources. Ours, taking the lessons from the CSP and transputer days, doesn't. 5 KB and we remove unused functionality at build time. It has been used from chips with just 2 K or data RAM to systems with 100's of processors (and each only having 128 K RAM).

Nevertheless, what you describe as moving towards a massive amount of resources on a single chip is happening. The move to multicore happens because it potentially reduces the system cost and Moore's law allows it. We call these devices RoCs (Rack On a Chip). All peripherals, some of them very complex like Gbit ethernet switches, are integrated. And the issue is not that one uses such chips for handling a single event at a time, but that several applications share all these resources, the memory as well as the I/O. These chips even have more resources than one would typically use in a single application. For safety, this even means assuring full task partitioning in time and space. Prioritisated scheduling (even with priority inheritance) is a must, even more so if the application needs to mimimise power consumption (sleep mode when in the lowest priority background loop). From pure real-time scheduling, applications are moving towards QoS based scheduling. The other issue remains at all times that memory and I/O don't scale with the core density and hence with the processing power. A CPU that has to wait or worse must poll has no benefit from being clocked at a high rate. That's why the better chips allow cache memory on-chip to be configured as SRAM.

The contradiction however is that multi-core is cheaper but it doesn't necessarily means better. Two slower chips can be better (for power consumption, safety) than a single faster, multicore one. I think we agree on that. The issue is then one of transparent and scalable programming. Superloop statemachines are very hard to distribute. This is where a concurrent programming model à la occam comes in, provided the hardware level communication is decoupled from the interaction between the application components/tasks/processes. Hence, explicit communication (reflecting the hardware level) should not be in the language. Even in the transputer days, changing a single hardware link meant sometimes a lot of source code changes because there was no transparent routing between the chips. That only improved with the Virtual Channels routers. The same applies to a "predictable" static schedule. Distributing this over multiple cores means re-programming and re-analysing it very carefully. Priority is a runtime attribute. This is why global priorities in combination with concurrent tasks/processes are needed. Just remap and if needed reverify by running a profiling tool. No source code changes needed because the functional behaviour is decoupled from the time behaviour. This is what concurrent programming is all about.

I grant that you make the argument that peripherals should have their own cores. It certainly makes sense when the main CPUs are heavily pipelined number crunchers. I made that argument already 15 years ago. But then the cost should be very low (adjust the size of the I/O CPU to the peripheral at hand) and the programming model should make that transparent. Is possible if the compiler is generated together with the CPU, often a few 1000 gates CPU will do. In the absence of that , we have today integrated interrupt controllers. Not perfect, but they work. And by the way, also the XMOS chip has hardware level interrupt support(which we use in our OpenComRTOS port).

Best regards,

Eric

From: David May [mailto:dave@xxxxxxxxxxxxx]
Sent: Thursday, October 04, 2012 1:04 AM
To: Eric Verhulst (OLS)
Cc: 'Occam Family'
Subject: Re: Occam-Tau - the natural successor to Occam-Pi - or is there one already?

Eric,

You seem to be stuck in the past. If you need more processors or

hardware threads, use them. It doesn't matter if you 'waste' some

processing resource - most current systems waste a lot of

resources - especially memory. And the most effective device to

respond to an external event is an idle processor - or thread.

It's probably ok to regard an XMOS thread as a Turing/Von-Neumann

machine. But XMOS is (I think) the only architecture that uses multiple

real-time threads as i/o controllers. I don't think this was envisaged by

Turing or Von Neumann. In this respect the XMOS cores do exactly

what you are asking for in your paragraph below starting "The major

point ..." - but they don't need "peripheral hardware" or a "schedulability

analysis". There's no need for Interrupts - multiple threads are much

more effective. You haven't commented on the ability to statically predict

and guarantee response times on the XMOS cores - this goes well

beyond anything that can be achieved using interrupts.

Why on earth would you want to run 100 real-time threads on just one

tiny processor? Is there a realistic example of this? I think there are

more interesting problems to solve.

David

On 3 Oct 2012, at 23:27, Eric Verhulst (OLS) wrote:

The test as described has some serious weaknesses:

1. The test is unrealistic. The ISR (or its equivalent on the XMOS chip) does nothing, except setting a bit to high. It should at least read a value from a peripheral register to be comparable with a real situation. In addition, it would be useful to stress test the CPU with some tasks that continuously exercise context switching (as this requires disabling interrupts).

2. FreeRTOS is free and that says it all. It even disables interrupts over every service call. As far as I can see, the RTOS was not even exercised or at least it is not clearly described.

3. This says little about the behavior when other interrupts are present as well. With one interrupt, one always get the best time.

The major point is however that hard real-time is not about how fast a CPU reacts to an external event/interrupt. This is one aspect, but essentially determined by how the peripheral hardware is designed (e.g. how long does it hold the data?). Any reaction time is acceptable as long as it is less than the holding time and less than the interval between two interrupts. Hard real-time is about meeting multiple deadlines in a predictable manner, independently of how many tasks/processes are running. The first step is to execute a scheduleability analysis (typically RMA). There are several papers and books since the 1970's describing this (Liu and Stanckovic). That is not enough, one must also guarantee that the Worst Case is strictly bounded because under realistic conditions the interrupt response time is an histogram, not a single value.

On the XMOS chip this works pretty well until one needs more processes than available hardware threads (8 at 100 MHz by using a kind of timeslicing) or in general hardware resources:

32-bit processor providing up to 700 MIPS (that is: all threads together)
Eight hardware threads and 32 channel ends
Ten timers and six clock blocks
Four XMOS Links
64KBytes SRAM and 8KBytes OTP memory (shared by all threads)

Because at the heart, the XMOS CPU is still a von Neuman machine. How will it behave with 100 (software) threads each having their deadline to meet? How will that be guaranteed when there are asynchronous interrupts, hence the scheduling is not strictly periodic and predictable. And how do you share and protect the shared resources (e.g. the memory).

FYI, see http://www.altreonic.com/sites/default/files/Transparent%20Programming%20of%20ManyMulti%20Cores%20with%20OpenComRTOS.pdffor a more elaborate comparison.

This being said, we are still waiting for the ideal multicore chip. The final limitation is not so much how many gates one can squeeze on a chip, but how many I/O pins and 0WS memory can be made available in a package. In addition, today the market needs are shifting towards safety. Gates are almost free and finally they are being used to make the chips more reliable and capable of handling runtime faults. See e.g.http://www.ti.com/lsds/ti/arm/hercules_arm_cortex_r_safety_microcontrollers/arm_cortex_r4/rm4_arm_cortex_r4/overview.page for this evolution. The XMOS architecture has some potential in this domain, if enabled by the software support.

Best regards,

Eric

From: David May [mailto:dave@xxxxxxxxxxxxx]
Sent: Wednesday, October 03, 2012 8:10 PM
To: Larry Dickson
Cc: eric.verhulst@xxxxxxxxxxxxx; 'Eric Verhulst (OLS)'; 'Occam Family'
Subject: Re: Occam-Tau - the natural successor to Occam-Pi - or is there one already?

Dear all,

I reluctantly added a single priority level to the transputer - at the

time the economics seemed to support this. But I'm still not sure.

Today, I'm sure the economics doesn't support the complexity

and overheads of priority schedulers, interrupts etc. The

hardware event-handling on the XMOS cores is faster than

anything an interrupt-based system can deliver - and

much easier to design with - here's a link:

https://www.xmos.com/download/public/Benchmark-Methods-to-Analyze-Embedded-Processors-and-Systems%28X7638A%29.pdf?support=1

Even for a single low-end core, this approach will easily

out-perform a conventional interrupt system.

With a lot of cores, as Larry has said, the event-handling will be

at the edge of the system. The core of the system will be running

parallel communication structures but - as I said in the

presentation - these have to be designed to ensure

efficient parallel communication flows.

Best wishes

David

On 3 Oct 2012, at 15:05, Larry Dickson wrote:

On Oct 3, 2012, at 1:25 AM, Eric Verhulst (OLS) <eric.verhulst@xxxxxxxxxxxxxxxxxxxxxx> wrote:

The bad news is that these chips are terribly complex to program. Silicon gates are supposed to be free, so the hardware has zillions of options. To understand the issue: the TI chips can route some 1000 interrupts to each core (using a 3 layer interrupt controller). Obviously, interrupt latency is still good, but relatively slower than on the much simpler ARM M3. The point is that a simple PRI ALT will not do the job. Two priorities are not sufficient. It worked more or less on the transputer because this chip had only one event pin. Impressive as the performance was at the time, it was still lacking for hard real-time when tens of processes were competing for the CPU. 32 to 256 priorities are needed to manage all the on-chip resources more or less satisfactorily.

So, it looks like the occam-restaurant is still at the end of the universe as Douglas Adams would have said. And looking at this discussion, the babelfish isn't helping very much.

Have fun,

Eric

Simple design answer: a chip (like the XMOS or Adapteva?) with a boundary of (say) 28 cores, each serving a single event/interrupt/link. Two-level PRI PAR independently on each core, so that each boundary core is absolutely responsive to its event. Extremely fast internal comms between cores, and hardware FIFOs between the boundary cores and the 36 internal cores, so that the reliably captured hard IO can handle slight delays before soft processing.

The reason I resist multiple hard priorities is that I think that solution is mostly a chimera. The top guy is OK, but number 2 and below swiftly become subject to occasional bad delays (when the interrupts happen to fall on top of each other). "[T]ens of processes … competing for the CPU" are only a real-time problem if they depend on lots of independent asynchronous stimuli (given a fast CPU). Multicore, which was not available in the time of the Transputer (except in the form of multiple Transputers, which was way expensive) lets you be responsive to all the stimuli independently.

Larry

RE: Occam-Tau - the natural successor to Occam-Pi - or is there one already?