[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Transistor count



Larry and all,

since my last posting I (again) discovered David’s home page [5], but (probably not again) discovered some of his original papers discussing the XC and XCore architecture. I made a summary in my own blog note [6]. Sorry for all my url refs, but if you make some sense of my words also without clicking on them, by all means, that would be nice.

As you may see from the blog note that [6] (the error/warning/info messages concerning concurrency chapters), the XC language is "C plus unknown X» - but that’s both pun and true. X is about concurrency, parallelism (in its own way). The X is a jewel in many ways. 

But as you may notice, the XC term has been downplayed by XMOS at some time. And now it’s even worse, with the xcore.ai coming with FreeRTOS. But the xTIMEcomposer compiles C and C++. The word XC has been ripped out of their vocabulary. I guess it’s banned by the marketing people. I can feel some sympathy with that standpoint, since programmers are the most concervative people in the world. But also, some of the most modern in the world. And they know what happened to occam. But in my head, they should lift XC and C and C++ and FreeRTOS up, and let them all shine.

XMOS’s practice is also somewhat strange, because the Tiobe index picks up some of the XC code, and it’s not non-existing at all [8]. Like, I guess, it would pick up my published code (but probably not the zip files, because I have an extra degree of difficulty in downloading them, I have added a _ in the url that the downloaders must remove themselves [9]), because I have placed a «XC programming”» in my XC notes. I have mailed with Paul Jansen at Tiobe (in Holland) about this, and it was he who pointed out that XC was indeed visible. 

But I assume you are interesting in this as a mental exercise. The present XCORE-200 ExplorerKit board's processor has 256 KB SRAM per tile, 512 in all. The newer xcore.ai would have more. I have a picture of that board in [7]. Only released for beta testers at the moment.

But that was no mental exercise. I get trapped in HW all the time.

The answer is: come with the C code for you you ask about and we could try - and wrap some of it in XC..

Øyvind

[5] http://people.cs.bris.ac.uk/~dave/
[6] https://www.teigfam.net/oyvind/home/technology/141-xc-is-c-plus-x/#background_info 
[7] https://www.teigfam.net/oyvind/home/technology/151-my-single-board-boards-and-why-notes/#xcoreai_explorer_board 
[8] https://www.teigfam.net/oyvind/home/technology/016-cooperative-scheduling-in-ansi-c-and-process-body-software-quality-metrics/#Ref10 
[9] https://www.teigfam.net/oyvind/home/technology/208-my-processor-to-analogue-audio-equaliser-notes/#startkit_code 

28. nov. 2020 kl. 01:02 skrev Larry Dickson <tjoccam@xxxxxxxxxxx>:

Hi Øyvind and all,

Going back to pick up a point in your Nov 24 2:36 AM (PST) (or Nov 23 5:36 PM European time?) post:

Suppose we want to go to the XCORE to emulate (?) a device that has lots of Transputer-like cores, connected (for starters) like Transputers in some topology. We want to put, say, 100 Transputer lookalikes on one core (or one logical thread?), and we accept the efficiency drop that multitasking imposes, not to mention the 40 or so link lookalikes that come out of the edge of our array of 100. It's the usual 10 x 10 square array (but maybe we can do stranger stuff like triangular array of triangles.) Nothing sacred about the number 100, but we certainly want more than 8. Anyway, each Transputer lookalike has effectively one sequential process running on it, with an ALT or select() for its link lookalikes, some of which may be edge, others interior.

Can this be done with XC using the standard, combinable, and distributable tasks? In standard occam on a standard Transputer, with hardware-software equivalence, you could use occonf to pretend there are hundreds of Transputers and map them 100 to a real Transputer, with some fancy extra processes to multiplex the edges. (If you are doing 4 links per virtual Transputer, 81 works better than 100, with 12 multiplexers ;-) Put code into my little guys doing some simple problem, say diffusion around a torus. Is this easy with XC, going for exact equivalence to the obvious occam program, even if the code looks different?

Larry

On Nov 24, 2020, at 2:36 AM, Øyvind Teig <oyvind.teig@xxxxxxxxxxx> wrote:

Hi all

All we need is a kernel to support the comms and interpret the occam byte code

that’s it. There hasn’t been too much talk about the programming model here. (Some, I know). After all it’s a HW thread. 

The mentioned occam byte code comes from occam, probably meaning kind of new occam code. A new occam, newer even than the occam 3 suggestion.

As some of you may know I have coded a lot in XC over the past years. Even if there are issues [1], I’ve come to like it a lot. But it’s for XCORE (from XMOS) and I assume it may be hard to get it going for other architectures, if one want to keep the timing guarantees and deterministic behaviour. But Turing said it could also run on a typewriter..

I certainly like their three task types: standard, combinable and distributable [2]. Even if I think it was an add-on to please people who loved tasks so much that they wanted more than one per logical core. Having threaded code programmers be aware of which type of task they have in front of them, won’t hurt. Occam supported standard, meaning that the processes there could have as much io, timer, internal communication as they needed. But it came at a cost. It’s that cost which is lowered with combinable (several tasks’ selects merged by compiler) or distributable (comm is simple function call using same stack).

I’m now awaiting my first xcore.ai board (not open for order yet). XMOS is now going «up» one step and ships FreeRTOS with their new system [3]. (No, they don’t pay me. I don’t even have an XMOS cup!)

I think they are targeting embedded ai, since they also include  two tiles of 8 logical cores each and «each tile is a self-contained processor with 512 kByte single cycle SRAM. The tile has a scalar unit (up to 1600MIPS), a vector unit (up to 25,600 MMACS), and a floating point unit (up to 800 MFLOPS); 1 Mbyte tightly coupled SRAM, 3200 MIPS, 1600 MFLOPS, and 51,200 MMACCS across the tiles.The device has three integrated PHYs: a high-speed USB, a MIPI D-PHY receiver, and LPDDR1.»

Parallel coding and parallel HW architectures.. I like what I see. May they succeed beyond Amazon Alexa cases, which that architecture is too good to be limited to.

[3] https://www.xmos.ai/download/xcore.ai-Product-brief(3).pdf 

24. nov. 2020 kl. 10:49 skrev Tony Gore <tony@xxxxxxxxxxxx>:

Hi Larry
 
Your modern transputer sort of exists. Take a Raspberry-Pi 4 – it has 1 gigabit ethernet connection, 2 USB 3 and 2 USB 2 so using USB to ethernet dongles, you can effectively get up to 5 comms links, although their speed won’t be perfectly balanced, and you can use the wifi as the “control” port. People have built clusters of these for their own personal supercomputers.
 
All we need is a kernel to support the comms and interpret the occam byte code – as it has a quad core processor, you could use one core to handle the comms, one the code interpreter and one the kernel.
 
Tony
 
From: Larry Dickson <tjoccam@xxxxxxxxxxx> 
Sent: 24 November 2020 00:20
To: Ruth Ivimey-Cook <ruth@xxxxxxxxxx>
Cc: occam-com@xxxxxxxxxx; Tony Gore <tony@xxxxxxxxxxxx>; Uwe Mielke <uwe.mielke@xxxxxxxxxxx>; Denis A Nicole <dan@xxxxxxxxxxxxxxx>; Øyvind Teig <oyvind.teig@xxxxxxxxxxx>; David May <David.May@xxxxxxxxxxxxx>; Michael Bruestle <michael_bruestle@xxxxxxxxx>; Transputer TRAM <claus.meder@xxxxxxxxxxxxxx>
Subject: Re: Transistor count
 
The good ideas are coming hard and fast . . . Thank you, Ruth! Notes below.
 
On Nov 23, 2020, at 3:35 PM, Ruth Ivimey-Cook <ruth@xxxxxxxxxx> wrote:


I've been following this discussion with great interest.

 

On 23/11/2020 21:43, Roger Shepherd wrote:

I think we can get smart here by giving some ground on "locally". Remember, in CSP any number of processes can share READ-ONLY memory, so you can have a sequence of "loading state" and "running state" (like the Transputer worm), and during running state a big block of read-only memory with the code is shared by, say, 100 nodes (each running the same, or almost the same, program). This requires a bit of design attention, because computer science says "any number read in parallel" but in the real world some sequences are involved.

Don’t believe this would work for independent processors. If they operate in a SIMD-like manner then may be - but you carry the problem every processor displaying worst case behaviour. You can’t share memory - that’s why processors are typically tightly coupled to their I-caches.

I agree that memory needs to be very local, or the processor will be starved of work. The harder problem is how much memory needs to be local. Modern highly parallel processors are showing some interesting trends. Have a look, for example, at the latest NVidia GPU architecture -- it is definitely worth study. A bank of 256KB memory is supplied which can be partitioned between two purposes - either as local cache or as local storage. The local storage option then partakes of a global memory address map, where access to non-local addresses uses a comms architecture (not a shared bus). The amount of cache vs storage is configurable at run-time.

Another aspect of NVidia's design and more recent AMD designs is of course the concept of core clusters, in which a group of 4-16 cores share some resources. In an ideal world this would not be necessary but physics tends to demand this sort of outcome, and it is probably worth investigating for more general purpose solutions.

Studying GPU (and AI) architecture is certainly a great idea. Some people clearly know a lot more about it than I do ;-) But one thing I believe always needs to be done in parallel - track certain use cases, and keep lots of envelopes around to scribble on the back of, because the use cases are going to have different proportions that get more and more different as parallelism increases.

 

Use cases would be at the center because there would be a manufacturing process that cheaply varies parameters to create a chip optimized for any given use case.

You’ll also need more communication capability to deal with the number of processors. It’s absolutely the case that the transputer processor is underpowered by today’s standards - I don’t know how by much.

I wonder if this is true - if you analyze it in time units of clock cycles per single core. I don't think it is true, if you analyze it in clock cycles per million transistors.

What are you trying to do? If you are trying to get 2 processors to solve a problem faster than 1 as a reasonable  cost - then you need to be using your resources quite well - one limited resource is RAM - you probably need to work it hard. 32-cycles for a 32-bit multiply doesn’t cut it against a 1-multiply per cycle processor. The thing is, the cost of certain useful functions (multiplication) are pretty cheap compared with the cost of a processor; not being reasonably competitive on performance means you need too many processors to solve the problem.  Now exactly how much processing you need to get a balanced system I do not know - but faster processors mean less of them, means less communication infrastructure, less total RAM

One thing I have been contemplating for some time is what it would take to make a modern transputer. I feel the critical element of a design is provision of channel/link hardware that has minimal setup time and DMA driven access. I feel reducing latency, especially setup time, requires a coprocessor-like interface to the cpu, so that a single processor instruction can initiate comms. If hardware access were required over PCIe, for example, it would take hundreds of processor cycles. Pushing that work into a coprocessor enables the main cpu to get on with other things, and maximises the chance that the comms will happen as fast as possible.

The other side of the coin would be that if the link engine(s) were essentially all wormhole routers, as for the C104 router chips, complete with packet addressing. Thus the link coprocessor would essentially become some number of interfaces directly to the CPU plus some number of interfaces to the external world, with a crossbar in the middle. This would massively increase the communications effectiveness of the design, and while taking up much more silicon area, I believe it would be an overall benefit for any non-trivial system. One net result is the elimination of the massive amount of 'receive and pass on' routing code that used to be needed with a directly connected link design.

An excellent point. But we need to remain use-case-sensitive. Some physics is so simple that perhaps the main effort and its communications would be so standard that little such direction mapping code would be needed - and then the overhead of the wormhole stuff could be a negative.
 
Different kinds of links/channels can branch out in even more directions that have never been explored, like hybrid between soft and hard (NUMA). Anything that acts like a channel may be our friend.

The final element of the mix would be to engineer the system such that software virtualisation of links was standard -- as was true on the transputer -- so code could think just about performing communication, not about which physical layer was involved, and also a way for the link engine to raise an exception (e.g. sw interrupt) to the processor if it cannot complete the communication 'instantly', thus potentially requiring a thread to suspend or be released.

I don't know, but from what I have seen so far I don't think it is worth the complexity and constraint of putting supprot for interleaved threads into the processor hardware, as the Ts did, but do feel it is valuable for the hardware to provide appropriate hooks for a light threaded kernel to do the job efficiently.

I am not following you here. Where did this weigh heavily? It never seemed much of a burden to me - much less than the burden of supporting a kernel. Basically it's interrupts (done way more cleanly than other processors), a few words of process support, and rare, simply implemented time-slicing. You cannot escape interrupts, and any kernel I ever heard of is far more onerous than this (and has horrible effects on the code design, by separating kernel from user code). What burdened the Transputer was the standard correctness checks, but if you want correct code . . . And even those could be streamlined.

So, to be clear, my ideal 'modern transputer' would be:

 - something similar to an ARM M4 CPU core at something like 500MHz with its own L1 I and D cache, e.g. 8KB each;

 - at least 256KB SRAM, partitionable into (shared I/D-cache) or local storage (and clusterable);

 - a comms coprocessor capable of single CPU cycle comms setup, and autonomous link operation for non-local packets, and fitted with at least 4 external links and at least two direct to CPU links.

    I would want to research further whether a fixed grid connection of these was adequate, or whether a more advanced option (that enabled some communications to travel further in one hop) was better. Similarly, selecting the most effective number of CPU links would need investigation (this number defines the max true parallelism of link comms to/from the processor).

Also:

 - a core cluster approach to memory that enables sharing 4 CPUs storage memory (i.e. up to ~960KB directly addressable local RAM);

 - an external memory interface permitting all core clusters to access an amount of off-chip memory at some (presumably slow) speed, all clusters seeing the memory at the same addresses;

 

I would hope to get 64 cores on a chip as 16 clusters, though that's probably impossible for current FPGA density because of the RAM...?

Ruth's proposals seem to be focused on a different set of use cases than mine, so there is room in the universe for both of us ;-) GPUs show there is room on my side, and I have a notion that study of use cases will show there is lots of room out in embedded-style hundred-thousand-core-land.
 
Larry


 

Regards

Ruth

 

-- 
Software Manager & Engineer
Tel: 01223 414180
Blog: http://www.ivimey.org/blog
LinkedIn: http://uk.linkedin.com/in/ruthivimeycook/




Øyvind TEIG 
+47 959 615 06
oyvind.teig@xxxxxxxxxxx
https://www.teigfam.net/oyvind/home
(iMac)