Hi all, Info: here is an alternative computing machine (www.emutechnology.com) which uses a
massive distributed memory architecture. The main difference vs actual high
performance computing is the approach to move small computing tread(s) close to
widely distributed data connected to simple CPUs instead of moving all
the data to a high performance CPU. That requires massive, fine grained task
parallelism and a message passing network for tread distribution and
synchronization, but may save a lot of energy. The 1st (demo) generation Emu (Chick) was build based on
Altera Arria-10 FPGA. -
One Node = 8
NodeLets -
One NodeLet = 1
“Gossamer” core running @ 300MHz -
each NodeLet has
one 8bit (Narrow Channel) DDR-SDRAM controller connected to one 2GB DDR4-SDRAM note: DDR4 uses a burst length of 8n, i.e. one access
delivers 8 byte in serial -
per Node one
Migration Engine crossbar w/ 6 communication ports (3D) each communication port is a 4-lane Serial Rapid-IO
Gen2.0 @ 2.5 Gbps in each direction -
per Node one
stationary core: PowerPC e5500 - for user interface, code & data &
result transfer to & from NodeLets Scaling (Emu Chick): -
up to 8 Node
Boards = 64 Gossamer cores The 2nd generation Emu-1 (Rack) is based on an custom ASIC
containing “Gossamer” cores & migration engine. -
One Node = 8
NodeLets -
One NodeLet = 4
“Gossamer” cores running @ ??? MHz -
each NodeLet
shares one 8bit (Narrow Channel) DDR-SDRAM controller connected to one 8GB
DDR4-2133-SDRAM -
per Node one
Migration Engine crossbar w/ 6 ports (3D) each communication port is a 4-lane Serial Rapid-IO
Gen2.3 @ 6.25 Gbps in each direction -
per Node one
stationary core: PowerPC e5500 - for user interface, code & data &
result transfer to & from NodeLets Scaling (limited by Rapid-IO ports): -
up to 64K ports
=> 64K Nodes (max. 2.097.152 Gossamer cores) -
Cube Topology
with Face Diagonals 120 GB/s bisection bandwidth Emu … key elements: “Gossamer” Core (GC) – up to 8 per Node -
64way
multithreaded (64bit?) CPU … to hide memory latency -
Accumulator + 16
general registers, small instruction cache, no data caches! -
one FMA FPU per
core -
instruction set: o rich suite of Atomic Memory Operations o SPAWN instruction to create a new thread (on any NodeLet) o RELEASE instruction to place context in a Service Queue for
processing by SC -
thread scheduling
is automatic and performed by hardware -
thread migration
to other NodeLets via HW queues (no software involvement) -
GC will perform
local computations and memory references -
GC will call
system services on Stationary Cores Stationary Core (SC) – one per Node -
64bit PowerPC
e5500 -
runs Linux
(CentOS) -
SC manages file
system (SATA interface) 1TB SSD per node -
SC manages I/O
(PCIe interface - Ethernet) -
SC will
initialize and close Gossamer threads (code, data-in, result-out) Migration Engine – one per Node -
crossbar w/ 6
communication ports -
each
communication ports is a 4-lane Serial Rapid-IO connection -
up to 64K ports =
64K Nodes (up to 8 NodeLets/Node) Hardware 1) Node Board = 1 Node (8 NodeLets) 2) 32-Node Motherboard -
will provide
Rapid-IO network interconnects for all nodes and to other motherboards 3) one 19” Rack 3U-Tray = one 32-Node Motherboard (1024 Gossamer cores) -
will provide
further Rapid-IO network interconnects … 4) one Rack = 8 Trays (8192 Gossamer cores) -
will provide
further Rapid-IO network interconnects … Scaling up to 256 Racks (2.097.152 Gossamer cores), 4PB DDR4 RAM, 128PB
SSD, ~10MW. Software LLVM compiler for Cilk - a truly parallel extension of C -
Emu code
generator coupled to LLVM front end -
Inlining of
many library functions to leverage Emu instructions -
added capability
to define memory views and place data in those views -
shared data items
are dynamic shared objects (DSOs) that may be defined outside the programs that
manipulate them and persist beyond program completion -
private automatic
variables are declared normally in Cilk. Emu Cilk extends C with a few new keywords: -
Cilk_spawn [<var> =] <function> creates new “child” thread running <function>
while parent thread continues asynchronously -
Cilk_sync causes current function to wait until all its children
have completed -
Cilk_for (<iterator>)
{<iteration_code>} [grainsize = <grainsize>] creates a group of new threads (from CilkPlus) -
Termination of a
function always performs an implicit sync -
No support (at
present) for INLET, CilkPlus vector operations, C++ (eventually plan to
add C++ functionality) System Software: -
LINUX runs on the
Stationary Cores (SCs). -
OS launches
main() user program on a Gossamer Core (GC) -
main() spawns descendants
that execute in parallel and migrate throughout system as needed. -
Runtime executes
primarily on the SCs and handles service requests from the threads running on
the GCs. o Memory allocation and release, I/O, exception handling, and
performance monitoring. o A few special system threads run on Gossamer Cores to provide
real-time system management, such as distribution of credits -
Threads return to
main() upon completion, which then
returns to the OS. References http://www.ipdps.org/ipdps2018/EmuTechTutorial-IPDPS2018.pdf
http://www.emutechnology.com/wp-content/uploads/2017/11/Emu1-Architecture.pdf
https://www.nas.nasa.gov/assets/pdf/ams/2016/AMS_20161215_Jacobsen.pdf
An Initial Characterization of the Emu Chick - http://hpcgarage.org/ipdps18/slides--emu.pdf
Implementing Radix Sort on Emu 1- http://www.cs.utah.edu/~rajeev/minutoli15.pdf
Gossamer
- https://www.usenix.org/legacy/events/hotpar10/tech/full_papers/Roback.pdf
Literature: https://www.nextplatform.com/2016/02/23/multicore-pioneer-tracks-architectural-path-to-exascale P.S.: ESPRIT (Transputer) Super Node machines have been built on
comparable principles 30 years ago … https://www.computer.org/csdl/proceedings/hicss/1989/1911/01/00047178.pdf
__________________________________________________________ Mit freundlichen Grüßen /
with best regards / 此致敬礼 Uwe
Mielke Customer Service &
Projects Design for Manufacturing Infineon Technologies IFD OP FE T TD ICDS CDS Koenigsbruecker
Strasse 180 D-01099
Dresden phone:+49 (351) 886.2923 mobile:+49 (176) 6220.4565 <mailto:uwe.mielke@xxxxxxxxxxxx> |