[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: XCore matters



David and Larry (++, I hope)

I had assumed a "tile" was a core, and that there were a number of cores on an XMOS chip. But, Øyvind, you said (about processes on different cores)

I am sorry, I probably have messed it up. But I have the below picture in my head all the time.

I have now not related to the xcore.ai architecture since I haven’t got any board yet. 

The XCore-200 architecture is described at https://www.xmos.ai/download/xCORE-200:-The-XMOS-XS2-Architecture-(ISA)(1.1).pdf (2015/04/01). What am I complaining about, it’s all there for me to read, 289 pages. I could have known the answer to everything..

I found the above at https://www.xmos.ai/xcore-200/ - where there are much more. Including the https://www.xmos.ai/download/xCORE-200-Product-Brief(1.0).pdf. Here’s the figure:

 

The left and the right columns are TILES. That phrase is not in the brief, but it is in the code, like on tile[0]: 

Each TILE has 8 LOGICAL CORES and one each of SRAM and ATP. Etc. 

I never had the impression that occam protocol semicolons were anything fundamental - just the start of a new assembly-level input or output, i.e. they could be put in sequence as separate lines of code, except for the need to group things in a PROTOCOL. In my language investigations using lex and yacc, I found the semicolon could be eliminated entirely (and replaced with a comma). That makes it possible to define a variant on occam that looks like C.

I thought semicolons in the protocol had semantic value, in that they introduced synchronization points. I first discovered this in the SPoC (Southampton Portable occam Compiler) where there was a complete channel comm between each semicolon (C code). Like two cogwheels. That said, I could’nt use it for anything, could I? So I thought it had to do with giving some other processes more time, to give communication, for all what it was worth, smaller granularity. I feel like this is kind of story telling, nice to read, perhaps, but precise..?

Øyvind


10. des. 2020 kl. 23:45 skrev Larry Dickson <tjoccam@xxxxxxxxxxx>:

Peter and Øyvind,

You are right, Peter, about the internal channel (I was confusing it with a link communication, where they do both deschedule). I knew that, and put it into my Fringes and Workshop. But my main point was about the DMA, and I still think I am right. The internal T4 communication can go at 40 MB/s (one word every 2 cycles) but the link is limited to less than 2 MB/s. Thus, dedicating a process to transmitting or receiving on a link could result in a 95% waste of cycles. With DMA, most of this is recaptured (my work on Crowell and Elzenga's results showed the burden bandwidth of a link transmission was 37 MB/s).

I had assumed a "tile" was a core, and that there were a number of cores on an XMOS chip. But, Øyvind, you said (about processes on different cores)

======
They share. But different slices don’t. 
======

And then you corrected yourself and said "slices" should be tiles, which I had assumed were cores. Now I am confused. By "core" I mean something that can run cycles without pause timewise absolutely on top of another core (not staggered). But what do they share, and what is distributed? Is there an XMOS assembly language?

What matters is which resources exist, and when they are in use. Transputer and occam make it pretty easy to follow that, but I have a hard time with all the distinctions being introduced by the XMOS.

I never had the impression that occam protocol semicolons were anything fundamental - just the start of a new assembly-level input or output, i.e. they could be put in sequence as separate lines of code, except for the need to group things in a PROTOCOL. In my language investigations using lex and yacc, I found the semicolon could be eliminated entirely (and replaced with a comma). That makes it possible to define a variant on occam that looks like C.

Larry

On Dec 10, 2020, at 12:31 PM, Øyvind Teig <oyvind.teig@xxxxxxxxxxx> wrote:

It looks like I meant to say that occam protocol semicolons and som XC interface patterns are the same. They are not, by far. XC has roles (client, server) and tasks would run in between. There also is a data-less synchronization, that I think may be compared with occam-pi’s ‘!!’. But the similarity came with the fact that there are chunks of communication, I meant to say.. 

Øyvind

10. des. 2020 kl. 21:10 skrev Øyvind Teig <oyvind.teig@xxxxxxxxxxx>:

«slice» -> «tile» (I always get that wrong, don’t I)

10. des. 2020 kl. 20:52 skrev Øyvind Teig <oyvind.teig@xxxxxxxxxxx>:

Larry,

On 10 Dec 2020, at 17:19, Larry Dickson <tjoccam@xxxxxxxxxxx> wrote:
Hi David and Øyvind,

Let me see if I have it right - supposing I were doing a transmission from one thread to another (on the same core) in the XMOS architecture, then the transmitting thread is dedicated to the transmission for that time and the receiving thread to the reception (and, if I have understood them right, they are both running simultaneously but with staggered cycles).

Disclaimer. I have XC'ed the XCore, not studied its internal structure or its instructions. I haven’t really figured out how to get hold of the necessary information(?) So the below is what I have been able to pick up. Please correct!  

I think on the XCore, it depends. All the logical cores on a slice have shared memory. If we talk about task to task synchronized communication I think the compiler builds up as many communication ways as possible. If a task could let go with just doing an inline call on the stack [[distributable]] it would build up code for 3 patterns, if a task may share a logical core then their select statements may be merged [[combinable]] then it would build up code for 2 patterns and if a task is like occam tasks, that it may have state changes also after the main select (standard task) then it would build up code for that pattern only. It would need a logical core by itself. The xmapper(?) then decides how to do the communication depending on the code and configuration.

It may then just communicate by moving words around with no protection (because it knows it’s not needed, single cycle instructions, single word and what do I know), or use hw locks (limited amount, but they are used only intermittingly, I think) or attach to a chanend and communicate synchronously. A chanend (in hw) is a handle for synchronisation, and provided one know what one does, may be used in both directions. Now I know need help from David. I think that there always is synchronisation, even if for the asynchronous XC interface patterns, because they are built by small synchronous communication/synchronisations. Like the semicolon in occam. An event mechanism gets a task going again when it’s supposed to. I showed on my slides in the Fringe in Dresden that the same code may end up using zero to (was it?) 6 chanends (https://www.teigfam.net/oyvind/home/technology/175-cpa-2018-fringe/).

In the Transputer, the transmitting and receiving threads would both be descheduled and a memcpy-like operation would be happening, which would fully occupy the uniprocessor, so no third thread would be making any progress then.

If transmitting from one core to another, some physical DMA-like thing has to happen on the XMOS, doesn't it? Or do different cores share memory?

They share. But different slices don’t. I am not certain if the DMA matter had anything to do with the kind of task-task communication I tried to sum up above. I just think (help!!) that was with ports only. Plus, I guess with the internal router on the XCore, which I think only takes care of comms between slices.

Øyvind

In the Transputer's case, the two processes (this time on different Transputers) each seem to themselves to be doing the same thing - being descheduled until the transmission is done - but because the link transmission is comparatively slow, DMA has the advantage of allowing a second process on each Transputer (only one process is transmitting or receiving on a core in this case) to progress between the "stolen" DMA cycles. The whole advantage of DMA is due to the fact that link transmissions are much slower than memcpy.

Larry

10. des. 2020 kl. 19:57 skrev Roger Shepherd <rog@xxxxxxxx>:

Larry

On 10 Dec 2020, at 17:19, Larry Dickson <tjoccam@xxxxxxxxxxx> wrote:

Hi David and Øyvind,

Let me see if I have it right - supposing I were doing a transmission from one thread to another (on the same core) in the XMOS architecture, then the transmitting thread is dedicated to the transmission for that time and the receiving thread to the reception (and, if I have understood them right, they are both running simultaneously but with staggered cycles).

Regarding the transputer

In the Transputer, the transmitting and receiving threads would both be descheduled and a memcpy-like operation would be happening, which would fully occupy the uniprocessor, so no third thread would be making any progress then.

I think you are talking about an internal channel here. You are not correct here. The second process to execute an input or output instruction on an internal channel does not deschedule - the (input or output) instruction copies the data and, when completed, schedules the other process. It is almost the case that no third process makes progress - a high-priority processes can interrupt a copy being performed by a low priority proces). 

Note that the transputer communicate in blocks of memory.

If transmitting from one core to another, some physical DMA-like thing has to happen on the XMOS, doesn't it? Or do different cores share memory?

I will let an Xcore expert respond to this but the thing about Xcore is that threads are really cheap with several interleaved at the same time. An Xcore thread can also synchronise with hardware very quickly - for example, on the cycle after a register gets set, an Xcore thread can take the data from the register. And, for example put it into memory. This is what David means by using an Xcore thread as a DMA engine. 

In the Transputer's case, the two processes (this time on different Transputers) each seem to themselves to be doing the same thing - being descheduled until the transmission is done - but because the link transmission is comparatively slow, DMA has the advantage of allowing a second process on each Transputer (only one process is transmitting or receiving on a core in this case) to progress between the "stolen" DMA cycles. The whole advantage of DMA is due to the fact that link transmissions are much slower than memcpy.

The advantage over what alternative? If the alternative were single byte or word communication then the presence of block based comms avoids (what would be a large) synchronisation overhead to communicate blocks of data. This is true internally and externally. If the alternative were to dedicate the processor to moving data from/to the link hardware then there is a synchronisation issue - how does a communication occur? The problem is that to make an external communication work two processes/CPUs have to be used, one on each end of the link; the internal communication can be performed by one process/CPU. Delegating communication to the link removes this synchronisation problem which would occur whatever the data rate.

Essential in the Xcore you can dedicate a thread to performing the function of a half-link, in fact potentially a more intelligent link which could perform scatter/gather. To get this to work internally as well requires that synchonisation is separated from communication.

Roger


Larry

On Dec 10, 2020, at 5:20 AM, Øyvind Teig <oyvind.teig@xxxxxxxxxxx> wrote:

Hi David,

this was so understandable! I never thought about it that way! I always thought of DMA as something HW.

Is it the fact that there’s a single cycle output that makes it to a software DMA? Why wouldn’t one also call the transputer architecture one? (Because it didn’t have direct port instructions like on the XCore?). Or was it the fact that one could run several of these in parallel on a barrel machine, and then the DMA action would happen in between other threads’ cycles. It wouldn’t on the transputer..?

I updated in the Barrel processor article:


Hard to find something to refer to, it’s always hard to make something truly lexical. But often someone would find a reference and add. I may update in the DMA article as well, since there is nothing there about the history of DMA. But I am reluctant to, since I am no specialist.

I didn’t even know about the concept of a barrel machine! 

Do you know if they had the concept of [[combinable]] and [[distributable]] in the CDC machines? If not, where did those concepts come from? I really like them.

Øyvind

10. des. 2020 kl. 10:47 skrev David May <David.May@xxxxxxxxxxxxx>:

Hi Øyvind,

The XCore processor has input and output instructions that transfer data to/from the input-output ports and the inter-process channels in a single cycle. If you write the obvious program to transfer a block of data - such as this one in the xc manual (p. 33)  

for (i=0; i<10; i++) c <: snd[i]

it will result is a short loop continually loading words from memory and outputting them. So it performs the same function as a DMA controller - and it is only one thread, so there can be others running concurrently performing computations and/or inputs and outputs. Like a hardware DMA controller, the performance is deterministic and predictable. Unlike a DMA controller, a thread like this can support arbitrary access patterns (scatter/gather, traversing lists and trees …), on-the-fly encoding/decoding etc - it’s just software!

The statement "The last two properties ensure that a load or store from memory always takes one or two instruction cycles on each core” in the document you found is misleading - the load or store always takes one cycle. The author may have been including the possibility that the thread is delayed because of the need for an instruction fetch, but this hardly ever happens - and is also deterministic and predictable.

This idea originated (I think) in the I/O system of the CDC6000 series - the XCore is a form of Barrel processor https://en.wikipedia.org/wiki/Barrel_processor.

All the best

David

On 9 Dec 2020, at 17:53, Øyvind Teig <oyvind.teig@xxxxxxxxxxx> wrote:

All,

I am attempting a branching off here. I realise I may be the only interested person.

But then I feel that some XCore matters convey relevant points as per the rest of the theme(s) of this thread (else they might not have been put on the table).

And besides, we have the designer available:

David (cc all),

6. des. 2020 kl. 19:17 skrev David May <David.May@xxxxxxxxxxxxx>:

...

It is entirely possible to fix the problems with communication set-up and completion - one option is the microprogrammed transputer-style implementation;
.
another is the multithreaded XCore style with single-cycle i/o instructions (which means that threads can act as programmable DMA controllers);

Is it possible that you could explain this?

Is the "programmable DMA controller" something explicit in XC or is it implicit for (not really?) any communication between cores on tiles or between tiles or ports. Or is it the channels or interfaces, or use of (safe) pointers?

- - -

Even if now have programmed in XC for years I can ask such s.. questions!

Here are some points I found by searching for «DMA» in the folder I have where I keep loads of downloads of XMOS/XCore related documents:

In [1] chapter 1.6 The underlying hardware model, it says (page 15/108) that:

* The memory on each tile has no cache.
* The memory on each tile has no data bus contention (all peripherals are implemented via the I/O sub-system which does not use the memory bus; there is no DMA for peripherals).

The last two properties ensure that a load or store from memory always takes one or two instruction cycles on each core. This makes worst case execution time analysis very accurate for code that accesses memory.

Tasks on different tiles do not share memory but can communicate via inter-core communication.

In xC, the underlying hardware platform being targeted provides a set of names that can be used to refer to tiles in the system. These are declared in the platform.h header file. The standard is to provide an array named tile. So the tiles of the system can be referred to as tile[0], tile[1], etc.

In [2] you write that (‘Threads’ page 7, also mentioned in XARCH2010 paper)

Threads are used for latency hiding or to implement ‘hardware’ functions such as DMA controllers and specialised interfaces

In [3] you write that (‘Processes - use ’ page 12, also in NOCS paper)

Implement ‘hardware’ functions such as DMA controllers and specialised interfaces

In [4] «DMA» is only used in the context of communication with the (odd!) ARM core on the dice. Like through library calls as «xab_init_dma_write».

[1] https://www.xmos.ai/file/xmos-programming-guide (2015/9/18)
[2] http://people.cs.bris.ac.uk/~dave/hotslides.pdf
[3] http://people.cs.bris.ac.uk/~dave/iet2009.pdf
[4] https://www.xmos.ai/file/tools-user-guide


Øyvind
https://www.teigfam.net/oyvind/home

PS. For later I also have some other XCore themes to attempt asking..










--
Roger Shepherd












Øyvind TEIG 
+47 959 615 06
oyvind.teig@xxxxxxxxxxx
https://www.teigfam.net/oyvind/home
(iMac)