RE: Occam-Tau - the natural successor to Occam-Pi

Just received info on the latest state of the art.

http://www.electronics-eetimes.com/en/french-startup-takes-on-fpgas-with-multicore-dsp-chip.html?cmp_id=7&news_id=222914154

http://www.kalray.eu/products/mppa-manycore/mppa-256/

Any comparison with other vendors making the same claim is left to the reader.

X-Chan: funny how you now embrace async after all the discussions we had on sync vs. async. Starting to look like our hubs. The key is indeed controlled async. And the right semantics for the services. (API).

From: Teig, Oyvind CCS [mailto:Oyvind.Teig@xxxxxxxxxxxxxxxx]
Sent: Wednesday, October 03, 2012 11:06 AM
To: Eric Verhulst (OLS)
Subject: SV: Occam-Tau - the natural successor to Occam-Pi - or is there one already?

Eric

You certainly argue well! J

(But in a way I hope you don’t stop this thread either! But then, being from non-UK mainland is no advantage, either..)

I had a paper at CPA-2012: http://www.teigfam.net/oyvind/pub/pub_details.html#XCHAN . You probably would hate it, as you don’t believe in CHANs reachable from program code any more? But then, Peter seemed to love it, and even modeled XCHANs in occam-pi!

Øyvind

Fra: Mailing List Robot [mailto:sympa@xxxxxxxxxx] På vegne av Eric Verhulst (OLS)
Sendt: 3. oktober 2012 10:25
Til: 'Eric Verhulst (OLS)'; 'Larry Dickson'
Kopi: 'Occam Family'
Emne: RE: Occam-Tau - the natural successor to Occam-Pi - or is there one already?

It all depends on what you are talking about.

There is still a very large market for low power, low cost microcontrollers. Target price 1 $ or less. Most often these chips are pretty good for predictable real-time (e.g. the ARM Cortex M3 gives us 300 nanosecs interrupt latency (= reading the timer value after interrupt)). And the on-chip peripherals one gets included mean that the PCB design is trivial.

Now how realistic is this "large pool of everything" chip?

The chips that come closest are high-end FPGAs (1000's of $ a piece and the processing blocks are reduced to a sea of logic gates) or GPUs. Both are not that easy to program, integrate in a system or are mainly meant for specific applications. Often best when working at high speed on regular data-streams.

A general purpose solution with "a large pool of everything" would have to obey the fundamental law of parallel processing; a computation to communication ratio between 1 and 10. Given that some chips exist that have a few 1000 CPUs cores (mainly used in basestations, again for processing regular datastreams), we should select the next big number, e.g. 1 million. So how would the chip look like to obey the fundamental ratio? 1 million CPU cores assuming clocked at 1 GHz, an internal NoC that has a bandwidth in the order of 1 GHz*1 million*bytes/sec., each of course let's say 512 MBytes on-chip SRAM (else we have wait states going off-chip to DDRAM°) and 1 million I/O points. For simplicity let's assume that all I/O is a pair of LVDS, so this will requires the same amount of ser-des circuits and about 2 million pins on this chip. Of course, the market will only buy this chip if it costs less than 10$ (in volume) and stays cool without a cooling tower and a fan the size of a windmill.

What's the state of the practice? Intel recently announced a chip with some tens of cores, but again not very useful for general purpose programming, certainly not embedded. We did a port to their experimental 48 core chip and bandwidth sucks all over the place.

The best examples I am aware of are the TI C6678 and Freescale P4080 8 core chips. Both are still conservative designs (a big highway in the middle) but they offer a lot in terms of processing power (160 GFlops), bandwidth (GBytes/sec), memory (512 KB L2 cache for each core that can be set up as SRAM) and tens of smart, high speed peripherals in a small package at a reasonable price. Power consumption is modest as well. That was the good news.

The bad news is that these chips are terribly complex to program. Silicon gates are supposed to be free, so the hardware has zillions of options. To understand the issue: the TI chips can route some 1000 interrupts to each core (using a 3 layer interrupt controller). Obviously, interrupt latency is still good, but relatively slower than on the much simpler ARM M3. The point is that a simple PRI ALT will not do the job. Two priorities are not sufficient. It worked more or less on the transputer because this chip had only one event pin. Impressive as the performance was at the time, it was still lacking for hard real-time when tens of processes were competing for the CPU. 32 to 256 priorities are needed to manage all the on-chip resources more or less satisfactorily.

So, it looks like the occam-restaurant is still at the end of the universe as Douglas Adams would have said. And looking at this discussion, the babelfish isn't helping very much.

Have fun,

Eric

From: Tony [mailto:Tony@xxxxxxxxxxxx]
Sent: Wednesday, October 03, 2012 12:11 AM
To: 'Larry Dickson'; Eric Verhulst (OLS)
Cc: 'Occam Family'
Subject: RE: Occam-Tau - the natural successor to Occam-Pi - or is there one already?

Isn't a single processor now an outdated concept for a modern parallel programming paradigm.

If I understood David (May) correctly, we should assume that we have a large pool of everything, and only in some cases will physical resources provide a practical limitation.

Tony Gore

Sent from my mobile

tony@xxxxxxxxxxxx +44 7768 598570

-----Original Message-----

From: Eric Verhulst (OLS)

Sent:  02-10-2012, 20:11

To: 'Larry Dickson'

Cc: 'Occam Family'

Subject: RE: Occam-Tau - the natural successor to Occam-Pi - or is there one already?

Am I understanding this right? The assumption is that all happens on a single processor?

Even if the system has no system-wide transparent semantics (and routing), it still means that the output on the channels (assuming occam) could have originated on another processor if there is more than one processor. That processor runs its own set of processes, independently of the scheduling on the processor where the ALT? is taken place. It might be idling or being locked up by an CPU eating process.

So the ALT? has no control whatsoever about the order in which the channels it is waiting on will trigger. When the ALT? is reached, some other channels might be ready already. When this point is reached, the first one on the list is passed on.

This means in practice and semantically, that the PRI doesn't matter. Hence the notion of PRI is wrong.

In the ALT construct, the notion of PRI on the channel reading side makes no sense because one doesn't control when the channels trigger. It only makes sense at the moment of the channel output, in its turn added to the list of enabled channels of the ALT construct. The synchronisation happens at some point but independently of when the channels were triggered. Fairness however does not help to reach real-time goals. It follows semantics of "eventually" (= soft real-time).

To be side-effect free, either all these outputs must be equal (in priority) either the notion of priority must be system-wide and consistently applied. Which means, anything that can execute (= ready to run) does it in order of priority and the system must guarantee that. If not, all are equal.

The two priorities on the transputer were than also not sufficient for hard real-time. The Hi Pri queue was more like the ISR llevel on standard processors. So everything else is Lo Pri of equal priority. Notwithstanding, it was a lot easier to program than the way it has to be done on standard processors.

From: Larry Dickson [mailto:tjoccam@xxxxxxxxxxx]
Sent: Tuesday, October 02, 2012 7:59 PM
To: eric.verhulst@xxxxxxxxxxxxx
Cc: 'Occam Family'
Subject: Re: Occam-Tau - the natural successor to Occam-Pi - or is there one already?

On Oct 2, 2012, at 10:26 AM, "Eric Verhulst $ALTREONIC$" <eric.verhulst@xxxxxxxxxxxxx> wrote:

I guess you are right. But this also means that the notion of PRI(ority) was screwed up.

It really meant "the first one in the list". Assuming that the second one on the list had a higher priority (which could not be specified) but was inserted later, it wouldn't have mattered. Hence the conclusion remains the same: assume nothing about the order or priority.

With a priority sorted list, this dilemma disappears. The priority is a system wide property (e.g. assigned to a task at design time following a RMA). When waiting on a priority sorted list, one gets the highest priority one that was inserted (before the waiting) even if inserted the last.

"One gets the highest priority one that was inserted" is always true but "before the waiting" is ambiguous. It's the highest priority one that was inserted before its own disabling. (If the ALT is in a low-priority process and the sender high priority, it could have even been inserted via an interrupt just before that disable instruction was executed, i.e. after the "waiting" was finished and some higher-ALT-priority disable instructions came up empty.)

"The first one on the list" (the source listing of the PRI ALT, or the assembly code listing of the disables) is the one that gets the highest priority. However, if by "list" you mean the readies that are inserted by the prospective senders, the one that has higher priority does get chosen even if it came ready later, as long as it came ready before the disable. This is a frequent occurrence when senders come ready to communicate before this ALT instance is executed --- for example, while the code branch resulting from a previous ALT selection in this same loop is still being executed, which may take a while. In such a case there is no wait. If the ALT suffers a wait, that is if no senders are in line when it got enabled, then the race is still possible but less frequent.

If there is a wait, and an ALT-lower-priority sender beats the ALT-higher-priority one by a few cycles, then the ALT disables will trigger fast enough to draw a blank on the higher-priority one and select the lower-priority one. It is immaterial whether the lower priority sender is PRI-high-priority or PRI-low-priority, though PRI-high-priority senders will tend to win these races because they can interrupt. In any such case, the ALT-high-priority sender will be picked up on the next loop iteration.

Larry

Again, one should not assume anything about the order but at least now the highest priority one is received first. This mechanism reduces the blocking time for higher priority tasks in the system and generally improves the real-time behavior without affecting the logical behaviour, if correctly used.

Eric

From: Mailing List Robot [mailto:sympa@kent.ac.uk] On Behalf Of Larry Dickson
Sent: Tuesday, October 02, 2012 6:46 PM
To: Occam Family
Subject: Re: Occam-Tau - the natural successor to Occam-Pi - or is there one already?
Importance: Low

On Oct 2, 2012, at 9:18 AM, Eric Verhulst (ALTREONIC) <eric.verhulst@xxxxxxxxxxxxx> wrote:

PS.

The PRI ALT is just a language concept. As far as I know never implemented.

This appears to be exactly backward. The PRI ALT is the only thing ever implemented. The priority depends on the order of disabling. See the Inmos Compiler Writers Guide.

Larry

The consequence is that it is really an ALT. Which means , don't assume anything about the order in which the ALT triggers. If not, you are likely to introduce side-effects in the code. Therefore the opposite is better. Always implement in order of priority but still assume that the "select" is priority independent. It keeps the behaviour consistent even if the timings (in a network) vary. Because in practice, one should not assume anything about the order of things on other nodes. Logical (functional) behaviour and timing behavior should be independent.