Hi all,
Info: here is an
alternative computing machine (www.emutechnology.com)
which uses a massive distributed memory architecture. The
main difference vs
actual high performance computing is the approach to move
small computing
tread(s) close to widely distributed data connected to
simple CPUs
instead of moving all the data to a high performance CPU.
That requires
massive, fine grained task parallelism and a message
passing network for tread
distribution and synchronization, but may save a lot of
energy.
The 1st
(demo) generation Emu (Chick) was
build based on Altera Arria-10 FPGA.
-
One Node = 8
NodeLets
-
One NodeLet = 1
“Gossamer” core running @ 300MHz
-
each NodeLet has
one 8bit (Narrow Channel) DDR-SDRAM controller connected
to one 2GB DDR4-SDRAM
note: DDR4 uses
a burst length of 8n, i.e.
one access delivers 8 byte in serial
-
per Node one
Migration Engine crossbar w/ 6 communication ports (3D)
each
communication port is a 4-lane Serial
Rapid-IO Gen2.0 @ 2.5 Gbps in each direction
-
per Node one
stationary core: PowerPC e5500 - for user interface, code
& data &
result transfer to & from NodeLets
Scaling (Emu Chick):
-
up to 8 Node
Boards = 64 Gossamer cores
The 2nd
generation Emu-1 (Rack) is based on
an custom ASIC containing “Gossamer” cores & migration
engine.
-
One Node = 8
NodeLets
-
One NodeLet = 4
“Gossamer” cores running @ ??? MHz
-
each NodeLet
shares one 8bit (Narrow Channel) DDR-SDRAM controller
connected to one 8GB
DDR4-2133-SDRAM
-
per Node one
Migration Engine crossbar w/ 6 ports (3D)
each
communication port is a 4-lane Serial
Rapid-IO Gen2.3 @ 6.25 Gbps in each direction
-
per Node one
stationary core: PowerPC e5500 - for user interface, code
& data &
result transfer to & from NodeLets
Scaling (limited by
Rapid-IO ports):
-
up to 64K ports
=> 64K Nodes (max. 2.097.152 Gossamer cores)
-
Cube Topology
with Face Diagonals 120 GB/s bisection bandwidth
Emu … key elements:
“Gossamer” Core (GC) –
up to 8 per Node
-
64way
multithreaded (64bit?) CPU … to hide memory latency
-
Accumulator + 16
general registers, small instruction cache, no data
caches!
-
one FMA FPU per
core
-
instruction set:
o rich suite of Atomic Memory Operations
o SPAWN instruction to create a new thread (on
any NodeLet)
o RELEASE instruction to place context in a
Service Queue for
processing by SC
-
thread
scheduling
is automatic and performed by hardware
-
thread migration
to other NodeLets via HW queues (no software involvement)
-
GC will perform
local computations and memory references
-
GC will call
system services on Stationary Cores
Stationary Core (SC) –
one per Node
-
64bit PowerPC
e5500
-
runs Linux
(CentOS)
-
SC manages file
system (SATA interface) 1TB SSD per node
-
SC manages I/O
(PCIe interface - Ethernet)
-
SC will
initialize and close Gossamer threads (code, data-in,
result-out)
Migration Engine – one
per Node
-
crossbar w/ 6
communication ports
-
each
communication ports is a 4-lane Serial Rapid-IO connection
-
up to 64K ports
=
64K Nodes (up to 8 NodeLets/Node)
Hardware
1) Node Board = 1 Node
(8 NodeLets)
2) 32-Node Motherboard
-
will provide
Rapid-IO network interconnects for all nodes and to other
motherboards
3) one 19” Rack
3U-Tray = one 32-Node Motherboard
(1024 Gossamer cores)
-
will provide
further Rapid-IO network interconnects …
4) one Rack = 8 Trays
(8192 Gossamer cores)
-
will provide
further Rapid-IO network interconnects …
Scaling up to 256
Racks (2.097.152 Gossamer cores),
4PB DDR4 RAM, 128PB SSD, ~10MW.
Software
LLVM compiler for Cilk
- a truly parallel extension of C
-
Emu code
generator coupled to LLVM front end
-
Inlining of
many library functions to leverage Emu instructions
-
added capability
to define memory views and place data in those views
-
shared data
items
are dynamic shared objects (DSOs) that may be defined
outside the programs that
manipulate them and persist beyond program completion
-
private
automatic
variables are declared normally in Cilk.
Emu
Cilk extends C with a few
new keywords:
-
Cilk_spawn
[<var> =] <function>
creates new
“child” thread running
<function> while parent thread continues
asynchronously
-
Cilk_sync
causes current
function to wait until all
its children have completed
-
Cilk_for
(<iterator>)
{<iteration_code>} [grainsize = <grainsize>]
creates a group
of new threads (from
CilkPlus)
-
Termination of a
function always performs an implicit sync
-
No support (at
present) for INLET, CilkPlus vector operations, C++
(eventually plan to
add C++ functionality)
System Software:
-
LINUX runs on
the
Stationary Cores (SCs).
-
OS launches
main() user program on a Gossamer Core (GC)
-
main() spawns descendants
that execute in parallel and migrate throughout system as
needed.
-
Runtime executes
primarily on the SCs and handles service requests from the
threads running on
the GCs.
o Memory allocation and release, I/O, exception
handling, and
performance monitoring.
o A few special system threads run on Gossamer
Cores to provide
real-time system management, such as distribution of
credits
-
Threads return
to
main() upon completion, which then
returns to the OS.
References
http://www.ipdps.org/ipdps2018/EmuTechTutorial-IPDPS2018.pdf
http://www.emutechnology.com/wp-content/uploads/2017/11/Emu1-Architecture.pdf
https://www.nas.nasa.gov/assets/pdf/ams/2016/AMS_20161215_Jacobsen.pdf
An Initial
Characterization of the Emu Chick - http://hpcgarage.org/ipdps18/slides--emu.pdf
Implementing Radix
Sort on Emu 1- http://www.cs.utah.edu/~rajeev/minutoli15.pdf
Gossamer - https://www.usenix.org/legacy/events/hotpar10/tech/full_papers/Roback.pdf
Literature:
https://www.nextplatform.com/2016/02/26/innovative-memory-server-knocks-analytics-systems-into-balance
https://www.nextplatform.com/2016/02/23/multicore-pioneer-tracks-architectural-path-to-exascale
P.S.: ESPRIT
(Transputer) Super Node machines have
been built on comparable principles 30 years ago …
https://www.computer.org/csdl/proceedings/hicss/1989/1911/01/00047178.pdf
__________________________________________________________
Mit
freundlichen Grüßen / with best regards / 此致敬礼
Uwe Mielke
Customer
Service & Projects
Design
for Manufacturing
Infineon
Technologies Dresden
GmbH
IFD OP FE T TD ICDS
CDS
Koenigsbruecker
Strasse 180
D-01099 Dresden
phone:+49
(351) 886.2923
mobile:+49
(176) 6220.4565
<mailto:uwe.mielke@xxxxxxxxxxxx>