Re: XCore matters

On Dec 10, 2020, at 5:20 AM, Øyvind Teig <oyvind.teig@xxxxxxxxxxx> wrote:

Hi David,

this was so understandable! I never thought about it that way! I always thought of DMA as something HW.

Is it the fact that there’s a single cycle output that makes it to a software DMA? Why wouldn’t one also call the transputer architecture one? (Because it didn’t have direct port instructions like on the XCore?). Or was it the fact that one could run several of these in parallel on a barrel machine, and then the DMA action would happen in between other threads’ cycles. It wouldn’t on the transputer..?

I updated in the Barrel processor article:

https://en.wikipedia.org/w/index.php?title=Barrel_processor&type=revision&diff=993404122&oldid=991440858

Hard to find something to refer to, it’s always hard to make something truly lexical. But often someone would find a reference and add. I may update in the DMA article as well, since there is nothing there about the history of DMA. But I am reluctant to, since I am no specialist.

I didn’t even know about the concept of a barrel machine!

Do you know if they had the concept of [[combinable]] and [[distributable]] in the CDC machines? If not, where did those concepts come from? I really like them.

Øyvind

10. des. 2020 kl. 10:47 skrev David May <David.May@xxxxxxxxxxxxx>:

Hi Øyvind,

The XCore processor has input and output instructions that transfer data to/from the input-output ports and the inter-process channels in a single cycle. If you write the obvious program to transfer a block of data - such as this one in the xc manual (p. 33)

for (i=0; i<10; i++) c <: snd[i]

it will result is a short loop continually loading words from memory and outputting them. So it performs the same function as a DMA controller - and it is only one thread, so there can be others running concurrently performing computations and/or inputs and outputs. Like a hardware DMA controller, the performance is deterministic and predictable. Unlike a DMA controller, a thread like this can support arbitrary access patterns (scatter/gather, traversing lists and trees …), on-the-fly encoding/decoding etc - it’s just software!

The statement "The last two properties ensure that a load or store from memory always takes one or two instruction cycles on each core” in the document you found is misleading - the load or store always takes one cycle. The author may have been including the possibility that the thread is delayed because of the need for an instruction fetch, but this hardly ever happens - and is also deterministic and predictable.

This idea originated (I think) in the I/O system of the CDC6000 series - the XCore is a form of Barrel processor https://en.wikipedia.org/wiki/Barrel_processor.

All the best

David

On 9 Dec 2020, at 17:53, Øyvind Teig <oyvind.teig@xxxxxxxxxxx> wrote:

All,

I am attempting a branching off here. I realise I may be the only interested person.

But then I feel that some XCore matters convey relevant points as per the rest of the theme(s) of this thread (else they might not have been put on the table).

And besides, we have the designer available:

David (cc all),

6. des. 2020 kl. 19:17 skrev David May <David.May@xxxxxxxxxxxxx>:

...

It is entirely possible to fix the problems with communication set-up and completion - one option is the microprogrammed transputer-style implementation;
.
another is the multithreaded XCore style with single-cycle i/o instructions (which means that threads can act as programmable DMA controllers);

Is it possible that you could explain this?

Is the "programmable DMA controller" something explicit in XC or is it implicit for (not really?) any communication between cores on tiles or between tiles or ports. Or is it the channels or interfaces, or use of (safe) pointers?

- - -

Even if now have programmed in XC for years I can ask such s.. questions!

Here are some points I found by searching for «DMA» in the folder I have where I keep loads of downloads of XMOS/XCore related documents:

In [1] chapter 1.6 The underlying hardware model, it says (page 15/108) that:

* The memory on each tile has no cache.
* The memory on each tile has no data bus contention (all peripherals are implemented via the I/O sub-system which does not use the memory bus; there is no DMA for peripherals).

The last two properties ensure that a load or store from memory always takes one or two instruction cycles on each core. This makes worst case execution time analysis very accurate for code that accesses memory.

Tasks on different tiles do not share memory but can communicate via inter-core communication.

In xC, the underlying hardware platform being targeted provides a set of names that can be used to refer to tiles in the system. These are declared in the platform.h header file. The standard is to provide an array named tile. So the tiles of the system can be referred to as tile[0], tile[1], etc.

In [2] you write that (‘Threads’ page 7, also mentioned in XARCH2010 paper)

Threads are used for latency hiding or to implement ‘hardware’ functions such as DMA controllers and specialised interfaces

In [3] you write that (‘Processes - use ’ page 12, also in NOCS paper)

Implement ‘hardware’ functions such as DMA controllers and specialised interfaces

In [4] «DMA» is only used in the context of communication with the (odd!) ARM core on the dice. Like through library calls as «xab_init_dma_write».

[1] https://www.xmos.ai/file/xmos-programming-guide (2015/9/18)
[2] http://people.cs.bris.ac.uk/~dave/hotslides.pdf
[3] http://people.cs.bris.ac.uk/~dave/iet2009.pdf
[4] https://www.xmos.ai/file/tools-user-guide

Øyvind
https://www.teigfam.net/oyvind/home

PS. For later I also have some other XCore themes to attempt asking..

Øyvind TEIG
+47 959 615 06
oyvind.teig@xxxxxxxxxxx
https://www.teigfam.net/oyvind/home
(iMac)