[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: AW: Emu ... (distributed) memory centric computing network

Not bad...

5W gives you:

 - 1 gossamer core,

 - 420KB ram

 - 14MB SSD


On 26/09/2018 06:39, Uwe Mielke wrote:

Hello Larry,


figures on pwr consumption … there are (on Emu website), but only for a 256 rack system …

Scaling up to 256 Racks (2.097.152 Gossamer cores), 4PB DDR4 RAM, 128PB SSD, ~10MW.





Von: Larry Dickson [mailto:tjoccam@xxxxxxxxxxx]
Gesendet: Montag, 24. September 2018 05:17
An: Bernard Pottier
Cc: uwe.mielke@xxxxxxxxxxx; occam-com@xxxxxxxxxx; Tuyen Phong Truong
Betreff: Re: Emu ... (distributed) memory centric computing network




I noticed the 300 MHz Nodelet in the original spec, but followed by question marks for CPU speed in the second version.


Is any use being made of the greater than first-power relation between CPU speed and power consumption - implying that LOWERING the CPU speed can increase the power use efficiency of a massively parallel system with variable core count?


Are there figures on transistor count and power consumption?


Larry Dickson


On Sep 23, 2018, at 11:41 AM, Bernard Pottier <bernard.pottier@xxxxxxxxxxxxx> wrote:

Le 23/09/2018 à 16:05, Uwe Mielke a écrit :

Hello Bernard,


Hello again Uwe,
and hello,  the Occam list,

pls note that I’m not related to EmuTechnologies. Your congratulations should go to Peter Kogge et al for their work on Emu.


Sorry for the mistake, and all my congratulations to Peter and his group.
Thank you Uwe for your valuable reference to these projects.

I discovered the Emu details on the web and judged it valuable to share its existence with the occam community … just due to some similarities with the ESPRIT Transputer SuperNode machines.


The Emu hardware may be efficient for any kind of sparse matrix problem as long as the communication effort stays low. You should start your own investigations …


It is also interesting in its focus on data and memory. Computing using memories can be closer
to a physical process than numerical interpretation, and it is compatible with low level technologies
including FPGAs.

Another comparable massive parallel hardware from Europe – resulting from the EU Human Brain Project – is called SpiNNaker . Here you may find some more information. The system may be useful for sparse matrix problems as well.

SpiNNaker2 – Towards extremely efficient digital neuromorphics and multi-scale brain emulation
(watch presentation)

(download pdf)


Thank you also for this reference. We are working on large scale problems, mostly geographic
scale at this moment, but very likely the neural connexion concept can be similar to other
natural situation such as water circulation.

Best regards,


Best regards,


Von: pottier [mailto:bernard.pottier@xxxxxxxxxxxxx]
Gesendet: Sonntag, 23. September 2018 11:22
An: Uwe Mielke
Cc: Truong Phong Tuyen 001702; Bernard Pottier
Betreff: Re: Emu ... (distributed) memory centric computing network



Hello Uwe,


Congratulation for this great job! I am curious to discover more on questions

such as data feeding and node programming capabilities.


We are investigating fine grain massive parallel applications using

high level (OO)  front end to generate code for Occam, GPUs, and MPI

more recently. Applications are simulators for the observation of

the physical environment. Do you think they could match your Emu?


Best regards,


Bernard Pottier

Professor, CS,

University of Brest/LabSTICC



Le 22/09/2018 à 22:10, Uwe Mielke a écrit :

Hi all,


Info:  here is an alternative computing machine (www.emutechnology.com) which uses a massive distributed memory architecture. The main difference vs actual high performance computing is the approach to move small computing tread(s) close to widely distributed data connected to simple CPUs  instead of moving all the data to a high performance CPU. That requires massive, fine grained task parallelism and a message passing network for tread distribution and synchronization, but may save a lot of energy.


The 1st (demo) generation Emu (Chick) was build based on Altera Arria-10 FPGA.

-         One Node = 8 NodeLets

-         One NodeLet = 1 “Gossamer” core running @ 300MHz

-         each NodeLet has one 8bit (Narrow Channel) DDR-SDRAM controller connected to one 2GB DDR4-SDRAM

note: DDR4 uses a burst length of 8n, i.e. one access delivers 8 byte in serial

-         per Node one Migration Engine crossbar w/ 6 communication ports (3D)

each communication port is a 4-lane Serial Rapid-IO Gen2.0 @ 2.5 Gbps in each direction

-         per Node one stationary core: PowerPC e5500 - for user interface, code & data & result transfer to & from NodeLets

Scaling (Emu Chick):

-         up to 8 Node Boards = 64 Gossamer cores


The 2nd generation Emu-1 (Rack) is based on an custom ASIC containing “Gossamer” cores & migration engine.

-         One Node = 8 NodeLets

-         One NodeLet = 4 “Gossamer” cores running @ ??? MHz

-         each NodeLet shares one 8bit (Narrow Channel) DDR-SDRAM controller connected to one 8GB DDR4-2133-SDRAM

-         per Node one Migration Engine crossbar w/ 6 ports (3D)

each communication port is a 4-lane Serial Rapid-IO Gen2.3 @ 6.25 Gbps in each direction

-         per Node one stationary core: PowerPC e5500 - for user interface, code & data & result transfer to & from NodeLets

Scaling (limited by Rapid-IO ports):

-         up to 64K ports => 64K Nodes (max. 2.097.152 Gossamer cores)

-         Cube Topology with Face Diagonals 120 GB/s bisection bandwidth



Emu … key elements:


“Gossamer” Core (GC) – up to 8 per Node

-         64way multithreaded (64bit?) CPU … to hide memory latency

-         Accumulator + 16 general registers, small instruction cache, no data caches!

-         one FMA FPU per core

-         instruction set:

o      rich suite of Atomic Memory Operations

o      SPAWN  instruction to create a new thread (on any NodeLet)

o      RELEASE  instruction to place context in a Service Queue for processing by SC

-         thread scheduling is automatic and performed by hardware

-         thread migration to other NodeLets via HW queues (no software involvement)

-         GC will perform local computations and memory references

-         GC will call system services on Stationary Cores

Stationary Core (SC) – one per Node

-         64bit PowerPC e5500

-         runs Linux (CentOS)

-         SC manages file system (SATA interface) 1TB SSD per node

-         SC manages I/O (PCIe interface - Ethernet)

-         SC will initialize and close Gossamer threads (code, data-in, result-out)

Migration Engine – one per Node

-         crossbar w/ 6 communication ports

-         each communication ports is a 4-lane Serial Rapid-IO connection

-         up to 64K ports = 64K Nodes (up to 8 NodeLets/Node)





1) Node Board = 1 Node (8 NodeLets)

2) 32-Node Motherboard

-         will provide Rapid-IO network interconnects for all nodes and to other motherboards

3) one 19” Rack 3U-Tray = one 32-Node Motherboard (1024 Gossamer cores)

-         will provide further Rapid-IO network interconnects …

4) one Rack = 8 Trays (8192 Gossamer cores)

-         will provide further Rapid-IO network interconnects …

Scaling up to 256 Racks (2.097.152 Gossamer cores), 4PB DDR4 RAM, 128PB SSD, ~10MW.





LLVM compiler for Cilk - a truly parallel extension of C

-         Emu code generator coupled to LLVM front end

-         Inlining  of many library functions to leverage Emu instructions

-         added capability to define memory views and place data in those views

-         shared data items are dynamic shared objects (DSOs) that may be defined outside the programs that manipulate them and persist beyond program completion

-         private automatic variables are declared normally in Cilk.

Emu Cilk extends C with a few new keywords:

-         Cilk_spawn  [<var> =] <function>

creates new “child” thread running <function> while parent thread continues asynchronously

-         Cilk_sync

causes current function to wait until all its children have completed

-         Cilk_for  (<iterator>) {<iteration_code>} [grainsize = <grainsize>]

creates a group of new threads (from CilkPlus)

-         Termination of a function always performs an implicit sync

-         No support (at present) for INLET, CilkPlus  vector operations, C++ (eventually plan to add C++ functionality)

System Software:

-         LINUX runs on the Stationary Cores (SCs).

-         OS launches main() user program on a Gossamer Core (GC)

-         main() spawns descendants that execute in parallel and migrate throughout system as needed.

-         Runtime executes primarily on the SCs and handles service requests from the threads running on the GCs.  

o      Memory allocation and release, I/O, exception handling, and performance monitoring.

o      A few special system threads run on Gossamer Cores to provide real-time system management, such as distribution of credits

-         Threads return to main() upon completion, which then returns to the OS.








An Initial Characterization of the Emu Chick - http://hpcgarage.org/ipdps18/slides--emu.pdf

Implementing Radix Sort on Emu 1-  http://www.cs.utah.edu/~rajeev/minutoli15.pdf

Gossamer - https://www.usenix.org/legacy/events/hotpar10/tech/full_papers/Roback.pdf







P.S.: ESPRIT (Transputer) Super Node machines have been built on comparable principles 30 years ago …




Mit freundlichen Grüßen / with best regards / 此致敬礼


Uwe Mielke

Customer Service & Projects

Design for Manufacturing


Infineon Technologies Dresden GmbH


Koenigsbruecker Strasse 180

D-01099 Dresden

phone:+49 (351) 886.2923

mobile:+49 (176) 6220.4565