[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: XCore matters



Hi David and Øyvind,

Let me see if I have it right - supposing I were doing a transmission from one thread to another (on the same core) in the XMOS architecture, then the transmitting thread is dedicated to the transmission for that time and the receiving thread to the reception (and, if I have understood them right, they are both running simultaneously but with staggered cycles). In the Transputer, the transmitting and receiving threads would both be descheduled and a memcpy-like operation would be happening, which would fully occupy the uniprocessor, so no third thread would be making any progress then.

If transmitting from one core to another, some physical DMA-like thing has to happen on the XMOS, doesn't it? Or do different cores share memory? In the Transputer's case, the two processes (this time on different Transputers) each seem to themselves to be doing the same thing - being descheduled until the transmission is done - but because the link transmission is comparatively slow, DMA has the advantage of allowing a second process on each Transputer (only one process is transmitting or receiving on a core in this case) to progress between the "stolen" DMA cycles. The whole advantage of DMA is due to the fact that link transmissions are much slower than memcpy.

Larry

On Dec 10, 2020, at 5:20 AM, Øyvind Teig <oyvind.teig@xxxxxxxxxxx> wrote:

Hi David,

this was so understandable! I never thought about it that way! I always thought of DMA as something HW.

Is it the fact that there’s a single cycle output that makes it to a software DMA? Why wouldn’t one also call the transputer architecture one? (Because it didn’t have direct port instructions like on the XCore?). Or was it the fact that one could run several of these in parallel on a barrel machine, and then the DMA action would happen in between other threads’ cycles. It wouldn’t on the transputer..?

I updated in the Barrel processor article:


Hard to find something to refer to, it’s always hard to make something truly lexical. But often someone would find a reference and add. I may update in the DMA article as well, since there is nothing there about the history of DMA. But I am reluctant to, since I am no specialist.

I didn’t even know about the concept of a barrel machine! 

Do you know if they had the concept of [[combinable]] and [[distributable]] in the CDC machines? If not, where did those concepts come from? I really like them.

Øyvind

10. des. 2020 kl. 10:47 skrev David May <David.May@xxxxxxxxxxxxx>:

Hi Øyvind,

The XCore processor has input and output instructions that transfer data to/from the input-output ports and the inter-process channels in a single cycle. If you write the obvious program to transfer a block of data - such as this one in the xc manual (p. 33)  

for (i=0; i<10; i++) c <: snd[i]

it will result is a short loop continually loading words from memory and outputting them. So it performs the same function as a DMA controller - and it is only one thread, so there can be others running concurrently performing computations and/or inputs and outputs. Like a hardware DMA controller, the performance is deterministic and predictable. Unlike a DMA controller, a thread like this can support arbitrary access patterns (scatter/gather, traversing lists and trees …), on-the-fly encoding/decoding etc - it’s just software!

The statement "The last two properties ensure that a load or store from memory always takes one or two instruction cycles on each core” in the document you found is misleading - the load or store always takes one cycle. The author may have been including the possibility that the thread is delayed because of the need for an instruction fetch, but this hardly ever happens - and is also deterministic and predictable.

This idea originated (I think) in the I/O system of the CDC6000 series - the XCore is a form of Barrel processor https://en.wikipedia.org/wiki/Barrel_processor.

All the best

David

On 9 Dec 2020, at 17:53, Øyvind Teig <oyvind.teig@xxxxxxxxxxx> wrote:

All,

I am attempting a branching off here. I realise I may be the only interested person.

But then I feel that some XCore matters convey relevant points as per the rest of the theme(s) of this thread (else they might not have been put on the table).

And besides, we have the designer available:

David (cc all),

6. des. 2020 kl. 19:17 skrev David May <David.May@xxxxxxxxxxxxx>:

...

It is entirely possible to fix the problems with communication set-up and completion - one option is the microprogrammed transputer-style implementation;
.
another is the multithreaded XCore style with single-cycle i/o instructions (which means that threads can act as programmable DMA controllers);

Is it possible that you could explain this?

Is the "programmable DMA controller" something explicit in XC or is it implicit for (not really?) any communication between cores on tiles or between tiles or ports. Or is it the channels or interfaces, or use of (safe) pointers?

- - -

Even if now have programmed in XC for years I can ask such s.. questions!

Here are some points I found by searching for «DMA» in the folder I have where I keep loads of downloads of XMOS/XCore related documents:

In [1] chapter 1.6 The underlying hardware model, it says (page 15/108) that:

* The memory on each tile has no cache.
* The memory on each tile has no data bus contention (all peripherals are implemented via the I/O sub-system which does not use the memory bus; there is no DMA for peripherals).

The last two properties ensure that a load or store from memory always takes one or two instruction cycles on each core. This makes worst case execution time analysis very accurate for code that accesses memory.

Tasks on different tiles do not share memory but can communicate via inter-core communication.

In xC, the underlying hardware platform being targeted provides a set of names that can be used to refer to tiles in the system. These are declared in the platform.h header file. The standard is to provide an array named tile. So the tiles of the system can be referred to as tile[0], tile[1], etc.

In [2] you write that (‘Threads’ page 7, also mentioned in XARCH2010 paper)

Threads are used for latency hiding or to implement ‘hardware’ functions such as DMA controllers and specialised interfaces

In [3] you write that (‘Processes - use ’ page 12, also in NOCS paper)

Implement ‘hardware’ functions such as DMA controllers and specialised interfaces

In [4] «DMA» is only used in the context of communication with the (odd!) ARM core on the dice. Like through library calls as «xab_init_dma_write».

[1] https://www.xmos.ai/file/xmos-programming-guide (2015/9/18)
[2] http://people.cs.bris.ac.uk/~dave/hotslides.pdf
[3] http://people.cs.bris.ac.uk/~dave/iet2009.pdf
[4] https://www.xmos.ai/file/tools-user-guide


Øyvind
https://www.teigfam.net/oyvind/home

PS. For later I also have some other XCore themes to attempt asking..