This email may contain proprietary information of BAE Systems and/or third parties.
Larry, You are right about 4 being enough – in theory. In practice, we want to use tetrahedra in order to allow an unstructured mesh that can follow arbitrarily complex
surfaces. It is then impossible to organise the data to ensure 4 unless we can put one cell on one core/processor/node. We add the complexity that we do not want to run one cell per core as we are modelling with billions of cells. With reasonable sized
groupings of cells, it is largely impossible to ensure fully connected sets on individual cores (e.g. no orphaned enclaves). This exemplifies the problem of parallel meshing of large complex geometries with unstructured meshes. Regards, Chris
Prof. Christopher C R Jones BSc. PhD C.Eng. FIET
BAE Systems Engineering Fellow
EMP Fellow of the Summa Foundation Principal Technologist – Electromagnetics
Military Air & Information
(
Direct: +44 (0) 3300 477425 Electromagnetic Engineering, W423A
(
Mobile: +44 (0)7855 393833 Engineering Integrated Solutions
7
Fax: +44 (0)1772 8 55262 Warton Aerodrome
* E-mail:
chris.c.jones@xxxxxxxxxxxxxx
Preston
:
Web:
www.baesystems.com PR4 1AX BAE Systems (Operations) Limited
This email has been sent from an account outside of the BAE Systems network.
The beauty of occam (and the Transputer, which is occam-driven hardware design) is that all of computing, at all levels, is correctly modeled with a few simple primitives (those skinny little classics, the occam-2 Reference Manual and the
Compiler Writer's Guide). None of the blinding distinctions between interrupts, kernel, operating system, user object code, and on and on . . . So you can just use it as pseudocode, and use further formally provable equivalences if necessary to get efficient
code (Øyvind pointed out some of this in his description of XCORE task types, which appear able to be derived from the occam type "standard" by compiler transformations). But the beauty of that is that the equivalences are usually pretty simple to describe
- like block device driver queues and IP - so you can start from your use case, model it in occam, and then select optimizations that get you 90% of the way to best possible, and use that to design your hardware and ISA. Then you cycle back with coding conventions
in occam that enable the optimizing compiler to validly "grab" the patterns and map them to the efficient ISA. I don't know if anyone has ever done the last step, but it ought to be doable, maybe even on the XMOS chips? Two points: (1) One of the key features of occam, in my opinion, is hardware-software equivalence (p 71 of the Manual, "Configuration does not affect the logical behavior of a program", elaborated with the brilliant configuration language which can
be either hard or soft). Thus I disagree with Ruth, and with XMOS, and believe the soft multitasking should be preserved - it permits ISRs to be cleanly modeled as coherent processes, and eliminates the need for a kernel or an OS. It's done by hand in assembly
coding chips like the Arduino and the MSP430, with typically one "low-priority" loop process and multiple interrupts. Matt Jadud, are you listening? (2) I agree with Chris Jones ====== I also think the 4 links is not enough. 6 should be a minimum for 3D simulations and 10-20 would be better more up to date unstructured problems. ====== I did some work in 1996 on classic Transputers, using Forrest Crowell and Neal Elzenga's published measurements, that indicated you could go to 8 links full speed both ways and still be running CPU calculations at 48% efficiency between
the DMA cycle stealing (see my Nov 19 9:55 AM PST message on this thread). That does not get you to 20, but hardware can be redesigned if link count increase is needed, e.g. by using 16 or 32 byte "words" in your DMA. By the way, you can get away with 4 in
3D if you have to (five tetrahedra make a cube), but it is clumsy. Larry On Nov 24, 2020, at 2:36 AM, Øyvind Teig <oyvind.teig@xxxxxxxxxxx> wrote:
Hi all
that’s it. There hasn’t been too much talk about the programming model here. (Some, I know). After all it’s a HW thread. The mentioned occam byte code comes from occam, probably meaning kind of
new occam code. A new occam, newer even than the occam 3 suggestion. As some of you may know I have coded a lot in XC over the past years. Even if there are issues [1], I’ve come to like it a lot. But it’s for XCORE (from XMOS) and I assume it may be hard to get it going for other architectures, if one want
to keep the timing guarantees and deterministic behaviour. But Turing said it
could also run on a typewriter.. I certainly like their three task types: standard,
combinable and distributable [2]. Even if I think it was an add-on to please people who loved tasks so much that they wanted more than one per logical core. Having threaded code programmers be aware of which type of task they have in front
of them, won’t hurt. Occam supported standard, meaning that the processes there could have as much io, timer, internal communication as they needed. But it came at a cost. It’s that cost which is lowered with
combinable (several tasks’ selects merged by compiler) or distributable (comm is simple function call using same stack). I’m now awaiting my first xcore.ai board (not open for order yet). XMOS is now going «up» one step and ships FreeRTOS with their new system [3]. (No, they don’t pay me. I don’t even have an XMOS cup!) I think they are targeting embedded ai, since they also include two tiles of 8 logical cores each and «each tile is a self-contained processor with 512 kByte single cycle SRAM. The tile has a scalar unit (up to 1600MIPS), a vector unit
(up to 25,600 MMACS), and a floating point unit (up to 800 MFLOPS); 1 Mbyte tightly coupled SRAM, 3200 MIPS, 1600 MFLOPS, and 51,200 MMACCS across the tiles.The device has three integrated PHYs: a high-speed USB, a MIPI D-PHY receiver, and LPDDR1.» Parallel coding and parallel HW architectures.. I like what I see. May they succeed beyond Amazon Alexa cases, which that architecture is too good to be limited to. [3] https://www.xmos.ai/download/xcore.ai-Product-brief(3).pdf
24. nov. 2020 kl. 10:49 skrev Tony Gore <tony@xxxxxxxxxxxx>: Hi Larry Your modern transputer sort of exists. Take a Raspberry-Pi 4 – it has 1 gigabit ethernet connection, 2 USB 3 and 2 USB 2 so using USB to ethernet dongles, you can effectively
get up to 5 comms links, although their speed won’t be perfectly balanced, and you can use the wifi as the “control” port. People have built clusters of these for their own personal supercomputers. All we need is a kernel to support the comms and interpret the occam byte code – as it has a quad core processor, you could use one core to handle the comms, one the code
interpreter and one the kernel. Tony From: Larry
Dickson <tjoccam@xxxxxxxxxxx> The good ideas are coming hard and fast . . . Thank you, Ruth! Notes below. On Nov 23, 2020, at 3:35 PM, Ruth Ivimey-Cook <ruth@xxxxxxxxxx> wrote:
Studying GPU (and AI) architecture is certainly a great idea. Some people clearly know a lot more about it than I do ;-) But one thing I believe always needs to be done in
parallel - track certain use cases, and keep lots of envelopes around to scribble on the back of, because the use cases are going to have different proportions that get more and more different as parallelism increases.
An excellent point. But we need to remain use-case-sensitive. Some physics is so simple that perhaps the main effort and its communications would be so standard that little
such direction mapping code would be needed - and then the overhead of the wormhole stuff could be a negative. Different kinds of links/channels can branch out in even more directions that have never been explored, like hybrid between soft and hard (NUMA). Anything that acts like
a channel may be our friend.
I am not following you here. Where did this weigh heavily? It never seemed much of a burden to me - much less than the burden of supporting a kernel. Basically it's interrupts
(done way more cleanly than other processors), a few words of process support, and rare, simply implemented time-slicing. You cannot escape interrupts, and any kernel I ever heard of is far more onerous than this (and has horrible effects on the code design,
by separating kernel from user code). What burdened the Transputer was the standard correctness checks, but if you want correct code . . . And even those could be streamlined.
Ruth's proposals seem to be focused on a different set of use cases than mine, so there is room in the universe for both of us ;-) GPUs show there is room on my side, and
I have a notion that study of use cases will show there is lots of room out in embedded-style hundred-thousand-core-land. Larry
******************************************************************** BAE Systems may process information about you that may be subject to data protection |