[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

AW: Emu ... (distributed) memory centric computing network

Hello Bernard,


pls note that I’m not related to EmuTechnologies. Your congratulations should go to Peter Kogge et al for their work on Emu.


I discovered the Emu details on the web and judged it valuable to share its existence with the occam community … just due to some similarities with the ESPRIT Transputer SuperNode machines.


The Emu hardware may be efficient for any kind of sparse matrix problem as long as the communication effort stays low. You should start your own investigations …


Another comparable massive parallel hardware from Europe – resulting from the EU Human Brain Project – is called SpiNNaker . Here you may find some more information. The system may be useful for sparse matrix problems as well.

SpiNNaker2 – Towards extremely efficient digital neuromorphics and multi-scale brain emulation
(watch presentation)
(download pdf)


Best regards,


Von: pottier [mailto:bernard.pottier@xxxxxxxxxxxxx]
Gesendet: Sonntag, 23. September 2018 11:22
An: Uwe Mielke
Cc: Truong Phong Tuyen 001702; Bernard Pottier
Betreff: Re: Emu ... (distributed) memory centric computing network



Hello Uwe,


Congratulation for this great job! I am curious to discover more on questions

such as data feeding and node programming capabilities.


We are investigating fine grain massive parallel applications using

high level (OO)  front end to generate code for Occam, GPUs, and MPI

more recently. Applications are simulators for the observation of

the physical environment. Do you think they could match your Emu?


Best regards,


Bernard Pottier

Professor, CS,

University of Brest/LabSTICC



Le 22/09/2018 à 22:10, Uwe Mielke a écrit :

Hi all,


Info:  here is an alternative computing machine (www.emutechnology.com) which uses a massive distributed memory architecture. The main difference vs actual high performance computing is the approach to move small computing tread(s) close to widely distributed data connected to simple CPUs  instead of moving all the data to a high performance CPU. That requires massive, fine grained task parallelism and a message passing network for tread distribution and synchronization, but may save a lot of energy.


The 1st (demo) generation Emu (Chick) was build based on Altera Arria-10 FPGA.

-         One Node = 8 NodeLets

-         One NodeLet = 1 “Gossamer” core running @ 300MHz

-         each NodeLet has one 8bit (Narrow Channel) DDR-SDRAM controller connected to one 2GB DDR4-SDRAM

note: DDR4 uses a burst length of 8n, i.e. one access delivers 8 byte in serial

-         per Node one Migration Engine crossbar w/ 6 communication ports (3D)

each communication port is a 4-lane Serial Rapid-IO Gen2.0 @ 2.5 Gbps in each direction

-         per Node one stationary core: PowerPC e5500 - for user interface, code & data & result transfer to & from NodeLets

Scaling (Emu Chick):

-         up to 8 Node Boards = 64 Gossamer cores


The 2nd generation Emu-1 (Rack) is based on an custom ASIC containing “Gossamer” cores & migration engine.

-         One Node = 8 NodeLets

-         One NodeLet = 4 “Gossamer” cores running @ ??? MHz

-         each NodeLet shares one 8bit (Narrow Channel) DDR-SDRAM controller connected to one 8GB DDR4-2133-SDRAM

-         per Node one Migration Engine crossbar w/ 6 ports (3D)

each communication port is a 4-lane Serial Rapid-IO Gen2.3 @ 6.25 Gbps in each direction

-         per Node one stationary core: PowerPC e5500 - for user interface, code & data & result transfer to & from NodeLets

Scaling (limited by Rapid-IO ports):

-         up to 64K ports => 64K Nodes (max. 2.097.152 Gossamer cores)

-         Cube Topology with Face Diagonals 120 GB/s bisection bandwidth



Emu … key elements:


“Gossamer” Core (GC) – up to 8 per Node

-         64way multithreaded (64bit?) CPU … to hide memory latency

-         Accumulator + 16 general registers, small instruction cache, no data caches!

-         one FMA FPU per core

-         instruction set:

o      rich suite of Atomic Memory Operations

o      SPAWN  instruction to create a new thread (on any NodeLet)

o      RELEASE  instruction to place context in a Service Queue for processing by SC

-         thread scheduling is automatic and performed by hardware

-         thread migration to other NodeLets via HW queues (no software involvement)

-         GC will perform local computations and memory references

-         GC will call system services on Stationary Cores

Stationary Core (SC) – one per Node

-         64bit PowerPC e5500

-         runs Linux (CentOS)

-         SC manages file system (SATA interface) 1TB SSD per node

-         SC manages I/O (PCIe interface - Ethernet)

-         SC will initialize and close Gossamer threads (code, data-in, result-out)

Migration Engine – one per Node

-         crossbar w/ 6 communication ports

-         each communication ports is a 4-lane Serial Rapid-IO connection

-         up to 64K ports = 64K Nodes (up to 8 NodeLets/Node)





1) Node Board = 1 Node (8 NodeLets)

2) 32-Node Motherboard

-         will provide Rapid-IO network interconnects for all nodes and to other motherboards

3) one 19” Rack 3U-Tray = one 32-Node Motherboard (1024 Gossamer cores)

-         will provide further Rapid-IO network interconnects …

4) one Rack = 8 Trays (8192 Gossamer cores)

-         will provide further Rapid-IO network interconnects …

Scaling up to 256 Racks (2.097.152 Gossamer cores), 4PB DDR4 RAM, 128PB SSD, ~10MW.





LLVM compiler for Cilk - a truly parallel extension of C

-         Emu code generator coupled to LLVM front end

-         Inlining  of many library functions to leverage Emu instructions

-         added capability to define memory views and place data in those views

-         shared data items are dynamic shared objects (DSOs) that may be defined outside the programs that manipulate them and persist beyond program completion

-         private automatic variables are declared normally in Cilk.

Emu Cilk extends C with a few new keywords:

-         Cilk_spawn  [<var> =] <function>

creates new “child” thread running <function> while parent thread continues asynchronously

-         Cilk_sync

causes current function to wait until all its children have completed

-         Cilk_for  (<iterator>) {<iteration_code>} [grainsize = <grainsize>]

creates a group of new threads (from CilkPlus)

-         Termination of a function always performs an implicit sync

-         No support (at present) for INLET, CilkPlus  vector operations, C++ (eventually plan to add C++ functionality)

System Software:

-         LINUX runs on the Stationary Cores (SCs).

-         OS launches main() user program on a Gossamer Core (GC)

-         main() spawns descendants that execute in parallel and migrate throughout system as needed.

-         Runtime executes primarily on the SCs and handles service requests from the threads running on the GCs.  

o      Memory allocation and release, I/O, exception handling, and performance monitoring.

o      A few special system threads run on Gossamer Cores to provide real-time system management, such as distribution of credits

-         Threads return to main() upon completion, which then returns to the OS.








An Initial Characterization of the Emu Chick - http://hpcgarage.org/ipdps18/slides--emu.pdf

Implementing Radix Sort on Emu 1-  http://www.cs.utah.edu/~rajeev/minutoli15.pdf

Gossamer - https://www.usenix.org/legacy/events/hotpar10/tech/full_papers/Roback.pdf







P.S.: ESPRIT (Transputer) Super Node machines have been built on comparable principles 30 years ago …




Mit freundlichen Grüßen / with best regards / 此致敬礼


Uwe Mielke

Customer Service & Projects

Design for Manufacturing


Infineon Technologies Dresden GmbH


Koenigsbruecker Strasse 180

D-01099 Dresden

phone:+49 (351) 886.2923

mobile:+49 (176) 6220.4565