Re: AW: Emu ... (distributed) memory centric computing network

Best regards,

Uwe

Von: pottier [mailto:bernard.pottier@xxxxxxxxxxxxx]
Gesendet: Sonntag, 23. September 2018 11:22
An: Uwe Mielke
Cc: Truong Phong Tuyen 001702; Bernard Pottier
Betreff: Re: Emu ... (distributed) memory centric computing network

Hello Uwe,

Congratulation for this great job! I am curious to discover more on questions

such as data feeding and node programming capabilities.

We are investigating fine grain massive parallel applications using

high level (OO) front end to generate code for Occam, GPUs, and MPI

more recently. Applications are simulators for the observation of

the physical environment. Do you think they could match your Emu?

Article available at : http://www.mdpi.com/1424-8220/18/7/2323
Copy to Tuyen who recently finished his PhD, background in concurrency/microelectronics

Best regards,

Bernard Pottier

Professor, CS,

University of Brest/LabSTICC

Le 22/09/2018 à 22:10, Uwe Mielke a écrit :

Hi all,

Info: here is an alternative computing machine (www.emutechnology.com) which uses a massive distributed memory architecture. The main difference vs actual high performance computing is the approach to move small computing tread(s) close to widely distributed data connected to simple CPUs instead of moving all the data to a high performance CPU. That requires massive, fine grained task parallelism and a message passing network for tread distribution and synchronization, but may save a lot of energy.

The 1^st (demo) generation Emu (Chick) was build based on Altera Arria-10 FPGA.

-         One Node = 8 NodeLets

-         One NodeLet = 1 “Gossamer” core running @ 300MHz

-         each NodeLet has one 8bit (Narrow Channel) DDR-SDRAM controller connected to one 2GB DDR4-SDRAM

note: DDR4 uses a burst length of 8n, i.e. one access delivers 8 byte in serial

-         per Node one Migration Engine crossbar w/ 6 communication ports (3D)

each communication port is a 4-lane Serial Rapid-IO Gen2.0 @ 2.5 Gbps in each direction

-         per Node one stationary core: PowerPC e5500 - for user interface, code & data & result transfer to & from NodeLets

Scaling (Emu Chick):

-         up to 8 Node Boards = 64 Gossamer cores

The 2^nd generation Emu-1 (Rack) is based on an custom ASIC containing “Gossamer” cores & migration engine.

-         One Node = 8 NodeLets

-         One NodeLet = 4 “Gossamer” cores running @ ??? MHz

-         each NodeLet shares one 8bit (Narrow Channel) DDR-SDRAM controller connected to one 8GB DDR4-2133-SDRAM

-         per Node one Migration Engine crossbar w/ 6 ports (3D)

each communication port is a 4-lane Serial Rapid-IO Gen2.3 @ 6.25 Gbps in each direction

-         per Node one stationary core: PowerPC e5500 - for user interface, code & data & result transfer to & from NodeLets

Scaling (limited by Rapid-IO ports):

-         up to 64K ports => 64K Nodes (max. 2.097.152 Gossamer cores)

-         Cube Topology with Face Diagonals 120 GB/s bisection bandwidth

Emu … key elements:

“Gossamer” Core (GC) – up to 8 per Node

-         64way multithreaded (64bit?) CPU … to hide memory latency

-         Accumulator + 16 general registers, small instruction cache, no data caches!

-         one FMA FPU per core

-         instruction set:

o      rich suite of Atomic Memory Operations

o      SPAWN instruction to create a new thread (on any NodeLet)

o      RELEASE instruction to place context in a Service Queue for processing by SC

-         thread scheduling is automatic and performed by hardware

-         thread migration to other NodeLets via HW queues (no software involvement)

-         GC will perform local computations and memory references

-         GC will call system services on Stationary Cores

Stationary Core (SC) – one per Node

-         64bit PowerPC e5500

-         runs Linux (CentOS)

-         SC manages file system (SATA interface) 1TB SSD per node

-         SC manages I/O (PCIe interface - Ethernet)

-         SC will initialize and close Gossamer threads (code, data-in, result-out)

Migration Engine – one per Node

-         crossbar w/ 6 communication ports

-         each communication ports is a 4-lane Serial Rapid-IO connection

-         up to 64K ports = 64K Nodes (up to 8 NodeLets/Node)

Hardware

1) Node Board = 1 Node (8 NodeLets)

2) 32-Node Motherboard

-         will provide Rapid-IO network interconnects for all nodes and to other motherboards

3) one 19” Rack 3U-Tray = one 32-Node Motherboard (1024 Gossamer cores)

-         will provide further Rapid-IO network interconnects …

4) one Rack = 8 Trays (8192 Gossamer cores)

-         will provide further Rapid-IO network interconnects …

Scaling up to 256 Racks (2.097.152 Gossamer cores), 4PB DDR4 RAM, 128PB SSD, ~10MW.

Software

LLVM compiler for Cilk - a truly parallel extension of C

-         Emu code generator coupled to LLVM front end

-         Inlining of many library functions to leverage Emu instructions

-         added capability to define memory views and place data in those views

-         shared data items are dynamic shared objects (DSOs) that may be defined outside the programs that manipulate them and persist beyond program completion

-         private automatic variables are declared normally in Cilk.

Emu Cilk extends C with a few new keywords:

-         Cilk_spawn [<var> =] <function>

creates new “child” thread running <function> while parent thread continues asynchronously

-         Cilk_sync

causes current function to wait until all its children have completed

-         Cilk_for (<iterator>) {<iteration_code>} [grainsize = <grainsize>]

creates a group of new threads (from CilkPlus)

-         Termination of a function always performs an implicit sync

-         No support (at present) for INLET, CilkPlus vector operations, C++ (eventually plan to add C++ functionality)

System Software:

-         LINUX runs on the Stationary Cores (SCs).

-         OS launches main() user program on a Gossamer Core (GC)

-         main() spawns descendants that execute in parallel and migrate throughout system as needed.

-         Runtime executes primarily on the SCs and handles service requests from the threads running on the GCs.

o      Memory allocation and release, I/O, exception handling, and performance monitoring.

o      A few special system threads run on Gossamer Cores to provide real-time system management, such as distribution of credits

-         Threads return to main() upon completion, which then returns to the OS.

References

http://www.ipdps.org/ipdps2018/EmuTechTutorial-IPDPS2018.pdf

http://www.emutechnology.com/wp-content/uploads/2017/11/Emu1-Architecture.pdf

https://www.nas.nasa.gov/assets/pdf/ams/2016/AMS_20161215_Jacobsen.pdf

An Initial Characterization of the Emu Chick - http://hpcgarage.org/ipdps18/slides--emu.pdf

Implementing Radix Sort on Emu 1- http://www.cs.utah.edu/~rajeev/minutoli15.pdf

Gossamer - https://www.usenix.org/legacy/events/hotpar10/tech/full_papers/Roback.pdf

Literature:

https://www.nextplatform.com/2016/02/26/innovative-memory-server-knocks-analytics-systems-into-balance

https://www.nextplatform.com/2016/02/23/multicore-pioneer-tracks-architectural-path-to-exascale

P.S.: ESPRIT (Transputer) Super Node machines have been built on comparable principles 30 years ago …

https://www.computer.org/csdl/proceedings/hicss/1989/1911/01/00047178.pdf

__________________________________________________________

Mit freundlichen Grüßen / with best regards / 此致敬礼

Uwe Mielke

Customer Service & Projects

Design for Manufacturing

Infineon Technologies Dresden GmbH

IFD OP FE T TD ICDS CDS

Koenigsbruecker Strasse 180

D-01099 Dresden

phone:+49 (351) 886.2923

mobile:+49 (176) 6220.4565

<mailto:uwe.mielke@xxxxxxxxxxxx>