[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Transistor count



Hi all

 

I recently came across this

 

Untether Delivers At-Memory AI  

By Linley Gwennap

 

Using what it calls an at-memory architecture, Untether AI has created a highly power-efficient accelerator that can achieve a stunning two petaop/s in a PCIe card. The at-memory design interleaves more than 250,000 tiny processing elements (PEs) inside a standard SRAM array. Putting the processing next to the memory enables massive data flow into the PEs. It also greatly reduces the power required to move data to the compute units; this movement consumes more power than the computation itself in traditional architectures. The startup is already sampling its TsunAImi card, which accelerates AI inference, and expects to ship production-qualified units in 2Q21.

 

Untether emerged from stealth and introduced its new architecture at the recent Linley Fall Processor Conference. The RunAI200 chip features 511 cores that contain a total of 192MB of primary SRAM, obviating the need for external DRAM. Operating at 960MHz, the cores generate peak performance of 502 trillion operations per second (TOPS) for the 8-bit integer (INT8) operations that commonly serve in neural-network inferencing. At a relatively low 100W TDP, RunAI200 offers more than 3x better TOPS per watt than Nvidia’s new Ampere A100 GPU. The new chip also supports an “eco mode” that delivers 377 TOPS at 47W.

 

To boost performance, the TsunAImi card includes four of these chips. At 2,000 TOPS, it outperforms all other high-end AI accelerators by more than 2x. To achieve this rating, the card has a 400W TDP, similar to the A100 card. Boasting nearly 800MB of on-chip memory, it can hold large models without using any DRAM. Untether expects the four-chip design to achieve 80,000 images per second (IPS) on ResNet-50, which would be a record for a single accelerator card. Alibaba’s HanGuang 800 card is rated at 78,563 IPS. 

 

Details can be found at https://www.untether.ai/technology

 

Looks a bit like a box full of B042s on a single chip. Claims very high core to core comms, but I don’t know how it is done – it is a row based ring, but no more details.

 

I can see why Chris cannot use all the cores – many chips today have complex multilevel caches and there are local caches, cluster caches, system caches. So if your data fits more or less into the per core cache, then it flies, but one that data set extends into a shared cache, that starts to become the bottleneck, and eventually it gets more efficient to run fewer of your cores because of the shared cache contention. Where I worked part time for many years, the effort put into caches was pretty significant.

 

For those who don’t know the story of the B042s. These were VME boards with a 6 x 7 array of transputers on them, in an array with links. I believe they were made originally for a batch of transputers that had some of the internal bonding wires misplaced, and rather than throw expensive chips away, some boards were made specially for these chips. So any programs had to run in the internal memory of the transputer. There were a few programs, like the Mandelbrot Set, which could run in the internal memory. There were others – I think there was a version of ray tracing – that used a group of three processors as parallel pipelines, with the two outer pipelines handling all the communication and the middle one doing the computation.

 

Transputer boards and TRAMs effectively scaled this up, but in those days there was not a CPU cache, but you could “place” some of your code and data into the single cycle internal memory rather than using external memory which was a minimum of 3 cycles.

 

AI is able to get phenomenal performance and is spurring the development of massively parallel chips because it only needs a simple subset of instructions, and the workload can be partitioned into small enough parcels to effectively use the locality in memory.

 

There are problems, such as in Risk Analysis, where the traditional method is to use Monte-Carlo simulations, and this requires enormous computing power as the number of risks grows. However, there are radical alternative approaches. At Risk Reasoning, we found that too many people have a problem with providing data input as a percentage e.g. this risk has a 11% probability of happening. This is because there is too much uncertainty, and you are then held to that precise figure. If you then split your risks into bands – (as an example) low, medium-low, medium, medium-high and high, most people are more comfortable making a judgement, whether or not you put percentage figures on the boundaries between them. Now you can treat it as a database problem, deal with uncertainty (because a risk can span multiple contiguous classifications) and greatly reduce the computing required and make it much more scalable.

 

It seems to me that we still have many of the same problems we had in the 1980s – trying to get algorithms that are parallel and map onto what we can build in hardware, and it is still the fundamental balance between CPU, memory and communications (bandwidth AND latency).

 

A big push into HPC actually originated with a project I helped create and managed a decade ago, which was successful in getting funded to develop server chips because we targeted an emerging market where many people wanted small programs running, rather than one person a big program https://cordis.europa.eu/project/id/247779 which kicked off ARM’s foray into cloud and then HPC.

 

 

Tony Gore

 

Aspen Enterprises Limited email  tony@xxxxxxxxxxxx

tel +44-1278-769008  GSM +44-7768-598570 URL:

 

Registered in England and Wales no. 3055963 Reg.Office Aspen House, Burton Row, Brent Knoll, Somerset TA9 4BW.  UK

 

 

 

From: Jones, Chris C (UK Warton) <chris.c.jones@xxxxxxxxxxxxxx>
Sent: 08 December 2020 23:18
To: Larry Dickson <tjoccam@xxxxxxxxxxx>; Ruth Ivimey-Cook <ruth@xxxxxxxxxx>
Cc: David May <David.May@xxxxxxxxxxxxx>; Paul Walker <paul@xxxxxxxxxxxx>; occam-com@xxxxxxxxxx; Tony Gore <tony@xxxxxxxxxxxx>; Uwe Mielke <uwe.mielke@xxxxxxxxxxx>; Denis A Nicole <dan@xxxxxxxxxxxxxxx>; Øyvind Teig <oyvind.teig@xxxxxxxxxxx>; Michael Bruestle <michael_bruestle@xxxxxxxxx>; Transputer TRAM <claus.meder@xxxxxxxxxxxxxx>
Subject: RE: Transistor count

 

 

This email may contain proprietary information of BAE Systems and/or third parties.

 

Larry,

 

Not all of us have given up exactly, but there are so many barriers to achieving massive parallelism.  I have to pack lots of processing into each core to hide the message passing, then I need loads of cache ram to make the amount of work efficient, that means I cannot efficiently make use of all the cores available on each processor – we run on just half most of the time and just waste the rest.  Thus, for our types of problem, I can find the sweet spot in terms of the number of cores and processors for a particular case, which in most cases is between 500 and 1800 cores.  Below the sweet spot the code runs linearly more slowly, at the sweet spot a single run might take up to 10 days or more, but above the sweet spot, the code runs progressively more slowly turning over remarkably quickly to, not just diminishing returns, but catastrophic slowing down so that at about 2000 cores we might as well run on a big workstation.

 

This means that I can increase throughput by running more cases simultaneously provided I have many cases to run.  What I cannot do is to run the problems faster, and that is what I need.  I was offered a computer as big as a block of flats that other day (in jest, I should add) but I had to say that it would still not make the code run faster.  That is the tragedy of present day parallel computing and HPC.  Is this why the term supercomputing seems to have been dropped?

 

By the way, I am still running and “embarrassingly parallel” code. 

 

Regards,

Chris

 

 

 

Prof. Christopher C R Jones BSc. PhD C.Eng. FIET

BAE Systems Engineering Fellow

EMP Fellow of the Summa Foundation

Principal Technologist – Electromagnetics               

 

Military Air & Information                                                 ( Direct:   +44 (0) 3300 477425

Electromagnetic Engineering, W423A                           ( Mobile:  +44 (0)7855 393833

Engineering Integrated Solutions                                  7 Fax:       +44 (0)1772 8 55262

Warton Aerodrome                                                           * E-mail:   chris.c.jones@xxxxxxxxxxxxxx 

Preston                                                                              : Web:      www.baesystems.com

PR4 1AX

 

BAE Systems (Operations) Limited
Registered Office: Warwick House, PO Box 87, Farnborough Aerospace Centre, Farnborough, Hants, GU14 6YU, UK
Registered in England & Wales No: 1996687

Exported from the United Kingdom under the terms of the UK Export Control Act 2002 (DEAL No 8106)

 

From: occam-com-request@xxxxxxxxxx [mailto:occam-com-request@xxxxxxxxxx] On Behalf Of Larry Dickson
Sent: 08 December 2020 01:37
To: Ruth Ivimey-Cook
Cc: David May; Paul Walker; Jones, Chris C (UK Warton); occam-com@xxxxxxxxxx; Tony Gore; Uwe Mielke; Denis A Nicole; Øyvind Teig; Michael Bruestle; Transputer TRAM
Subject: Re: Transistor count

 


PHISHING ALERT

This email has been sent from an account outside of the BAE Systems network.

Please treat the email with caution, especially if you are requested to click on a link or open an attachment.
For further information on how to spot and report a phishing email please access the Global Intranet then select <Functions> / <IT>.

 


Ruth and David,

 

I am trying to follow your discussion here, but let me cut to the chase with the implication that communication setup time has stalled at 1 us and not improved since the Transputer. Measured in cycles, that seems to mean communication setup time has INCREASED from 20 cycles to about 2000 cycles. This problem is not new. I remember one of the occam porting projects, to (I believe) the PowerPC (I can't find the reference), actually tackled external communication, but the overhead was about 700 cycles if I remember right.

 

Fixing the problems (as David says) that slow you down by a factor of 100 certainly seems to make sense. I wonder what the physical limits are if you try to make a 2GHz Transputer? Speed of light for 20 cycles would still give you 3 meters. I know this is crude extrapolation, but . . . I wonder if the real problem is that people have given up on massively parallel programming (except for GPUs).

 

Larry

 

On Dec 6, 2020, at 1:42 PM, Ruth Ivimey-Cook <ruth@xxxxxxxxxx> wrote:

 

Hi David,

Thanks for your thoughts.

It is in part the off-chip/on-chip disparity that makes me think that a co-processor style arrangement for comms is what is needed. As soon as the CPU has to go off-chip all the latencies grow massively, so keep it local! Again, the need to flip to kernel mode and initialise DMA and interrupts makes me feel that a coprocessor that "understands" what is going on, and can therefore avoid needing to be told everything explicitly each time, is the right path. Ideally, the 'start communication' instruction would take as long to execute as a multiply, essentially transferring values from cpu registers to a queue on the hardware, with another instruction that executes a software trap if the operation hasn't completed (so the calling thread can wait on completion).

The x86_64 cpus from Intel and AMD both have very high speed inter-core interconnects, though I know little about them. They are apparently good for bus snooping and therefore of tranferring short (cache-line = 64 .. 256 bytes) length blocks of data very quickly. It would be really interesting to know if this hardware could be targetted by specialised processor instructions to make it useful for explicit communication, rather than implicit.

You're right about memory, of course, though of late the DDR line of memory has been improving faster than it did for quite a while (for about 10 years, performance improved by about 20%, but in the last 3 or 4 it's almost doubled). However, that doubling performance is set against the 100 to 1000 fold  performance differential to the cpu. Recent generations of memory interface have been trying to address this with wider memory busses -- some Intel chips now have 4 parallel channels of DDR4 memory for 1024-bit wide interface, but although this improves overall bus speed it does little for the latency of any individual transfer. The famous "Spectre" (etc) bugs in chip design come about mostly in the efforts to hide this latency from the core.

The only way I've seen of addressing the issue is to being memory on-chip... which is what High Bandwidth Memory (HBM) does. It puts a large DRAM chip physically on top of the processor, with parallel interconnections all across the plane. Thus both a wide memory interface, and a much higher speed (short & small wires) one can be achieved. The downside is that the size of memory is limited to what can fit in one chip, and it also limits processor power dissipation.

No magic bullets then. But it will take a lot of pressure to push the software industry to change their methods to enable the adoption of more parallel-friendly hardware, and I don't see that happening at present. The current pressures on programming are all aimed at reducing the cost to market, even at the expense of later maintenance. Achieved software quality is (IMO) low, and getting worse.

Best wishes,

Ruth

 

On 06/12/2020 18:17, David May wrote:

Hi Ruth,

 

There are quite a few studies of HPC message latency - for example https://www.hpcwire.com/2019/12/02/it-is-about-latency/

 

Ideally, a processor should be able to communicate data as fast as it can process data. And ideally, this should apply to small items of data, not just large ones. We’d like to be able to off-load a procedure call to another processor, for example. The transputer came close to this in 1985. Since then, processor performance has increased by a factor of more than 1,000 (a combination of clock speed and instruction-level parallelism), but communication performance hasn’t increased much at all (except for very large data items). The issue seems to be primarily the set-up and completion times - switch to kernel, initialise DMA controller, de-schedule process … interrupt, re-schedule process. However, there is also an issue with chip-to-chip communication. If you build chips with lots of processors - potentially hundreds with 5nm process technology - the imbalance between on-chip processing performance and inter-chip communication performance is vast (both latency and throughput).  

 

I made some proposals about these issues in http://www.icpp-conf.org/2017/files/keynote-david-may.pdf (especially from slide 27).

 

Incidentally, the problem with most current architectures isn’t just communications - the memory systems don’t work well either (see from slide 17 of above)!

 

It is entirely possible to fix the problems with communication set-up and completion - one option is the microprogrammed transputer-style implementation; another is the multithreaded XCore style with single-cycle i/o instructions (which means that threads can act as programmable DMA controllers); yet another is to provide execution modes (with dedicated registers) for complex instructions and interrupts - this enables the process management to be done by software and was first used on the Atlas computers of the 1960s (Normal control, Extracode control, Interrupt control)!

 

All the best

 

David

 

 

 

 

On 5 Dec 2020, at 03:04, Ruth Ivimey-Cook <ruth@xxxxxxxxxx> wrote:

 

David,

Did you see my comments earlier, where I was talking about, among other things, my ideas for a modern transputer? I mentioned various points about latency, and would be interested on your thoughts. I have quoted the relevant part below in case not.

I'm raising it again because you mentioned the 1us latency (I presume this is the time to indicate a communication is needed, not to complete it, so no overlapping of this with other work). It doesn't surprise me: most I/O subsystems seem to be adequately optimised for bulk throughput but really awful on startup latency.

On 04/12/2020 19:24, David May wrote:

This issue - and the related issue of processor-interconnect latency - is the main reason that we’re not doing much parallel computing. The interprocessor communication latency is still around 1microsecond (same as transputer) for a short message. As a result, many ‘supercomputers’ are just used as clusters running scripts that launch lots of small jobs.

Best wishes,

Ruth

 

On 23 Nov 2020 23:35, Ruth Ivimey-Cook wrote:

 

One thing I have been contemplating for some time is what it would take to make a modern transputer. I feel the critical element of a design is provision of channel/link hardware that has minimal setup time and DMA driven access. I feel reducing latency, especially setup time, requires a coprocessor-like interface to the cpu, so that a single processor instruction can initiate comms. If hardware access were required over PCIe, for example, it would take hundreds of processor cycles. Pushing that work into a coprocessor enables the main cpu to get on with other things, and maximises the chance that the comms will happen as fast as possible.

The other side of the coin would be that if the link engine(s) were essentially all wormhole routers, as for the C104 router chips, complete with packet addressing. Thus the link coprocessor would essentially become some number of interfaces directly to the CPU plus some number of interfaces to the external world, with a crossbar in the middle. This would massively increase the communications effectiveness of the design, and while taking up much more silicon area, I believe it would be an overall benefit for any non-trivial system. One net result is the elimination of the massive amount of 'receive and pass on' routing code that used to be needed with a directly connected link design.

The final element of the mix would be to engineer the system such that software virtualisation of links was standard -- as was true on the transputer -- so code could think just about performing communication, not about which physical layer was involved, and also a way for the link engine to raise an exception (e.g. sw interrupt) to the processor if it cannot complete the communication 'instantly', thus potentially requiring a thread to suspend or be released.

I don't know, but from what I have seen so far I don't think it is worth the complexity and constraint of putting supprot for interleaved threads into the processor hardware, as the Ts did, but do feel it is valuable for the hardware to provide appropriate hooks for a light threaded kernel to do the job efficiently.

 

 

 

On 06/12/2020 18:17, David May wrote:

Hi Ruth,

 

There are quite a few studies of HPC message latency - for example https://www.hpcwire.com/2019/12/02/it-is-about-latency/

 

Ideally, a processor should be able to communicate data as fast as it can process data. And ideally, this should apply to small items of data, not just large ones. We’d like to be able to off-load a procedure call to another processor, for example. The transputer came close to this in 1985. Since then, processor performance has increased by a factor of more than 1,000 (a combination of clock speed and instruction-level parallelism), but communication performance hasn’t increased much at all (except for very large data items). The issue seems to be primarily the set-up and completion times - switch to kernel, initialise DMA controller, de-schedule process … interrupt, re-schedule process. However, there is also an issue with chip-to-chip communication. If you build chips with lots of processors - potentially hundreds with 5nm process technology - the imbalance between on-chip processing performance and inter-chip communication performance is vast (both latency and throughput).  

 

I made some proposals about these issues in http://www.icpp-conf.org/2017/files/keynote-david-may.pdf (especially from slide 27).

 

Incidentally, the problem with most current architectures isn’t just communications - the memory systems don’t work well either (see from slide 17 of above)!

 

It is entirely possible to fix the problems with communication set-up and completion - one option is the microprogrammed transputer-style implementation; another is the multithreaded XCore style with single-cycle i/o instructions (which means that threads can act as programmable DMA controllers); yet another is to provide execution modes (with dedicated registers) for complex instructions and interrupts - this enables the process management to be done by software and was first used on the Atlas computers of the 1960s (Normal control, Extracode control, Interrupt control)!

 

All the best

 

David

 

 

 

 

On 5 Dec 2020, at 03:04, Ruth Ivimey-Cook <ruth@xxxxxxxxxx> wrote:

 

David,

Did you see my comments earlier, where I was talking about, among other things, my ideas for a modern transputer? I mentioned various points about latency, and would be interested on your thoughts. I have quoted the relevant part below in case not.

I'm raising it again because you mentioned the 1us latency (I presume this is the time to indicate a communication is needed, not to complete it, so no overlapping of this with other work). It doesn't surprise me: most I/O subsystems seem to be adequately optimised for bulk throughput but really awful on startup latency.

On 04/12/2020 19:24, David May wrote:

This issue - and the related issue of processor-interconnect latency - is the main reason that we’re not doing much parallel computing. The interprocessor communication latency is still around 1microsecond (same as transputer) for a short message. As a result, many ‘supercomputers’ are just used as clusters running scripts that launch lots of small jobs.

Best wishes,

Ruth

 

On 23 Nov 2020 23:35, Ruth Ivimey-Cook wrote:

 

One thing I have been contemplating for some time is what it would take to make a modern transputer. I feel the critical element of a design is provision of channel/link hardware that has minimal setup time and DMA driven access. I feel reducing latency, especially setup time, requires a coprocessor-like interface to the cpu, so that a single processor instruction can initiate comms. If hardware access were required over PCIe, for example, it would take hundreds of processor cycles. Pushing that work into a coprocessor enables the main cpu to get on with other things, and maximises the chance that the comms will happen as fast as possible.

The other side of the coin would be that if the link engine(s) were essentially all wormhole routers, as for the C104 router chips, complete with packet addressing. Thus the link coprocessor would essentially become some number of interfaces directly to the CPU plus some number of interfaces to the external world, with a crossbar in the middle. This would massively increase the communications effectiveness of the design, and while taking up much more silicon area, I believe it would be an overall benefit for any non-trivial system. One net result is the elimination of the massive amount of 'receive and pass on' routing code that used to be needed with a directly connected link design.

The final element of the mix would be to engineer the system such that software virtualisation of links was standard -- as was true on the transputer -- so code could think just about performing communication, not about which physical layer was involved, and also a way for the link engine to raise an exception (e.g. sw interrupt) to the processor if it cannot complete the communication 'instantly', thus potentially requiring a thread to suspend or be released.

I don't know, but from what I have seen so far I don't think it is worth the complexity and constraint of putting supprot for interleaved threads into the processor hardware, as the Ts did, but do feel it is valuable for the hardware to provide appropriate hooks for a light threaded kernel to do the job efficiently.

 

 

 

 

********************************************************************
This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender.
You should not copy it or use it for any purpose nor disclose or
distribute its contents to any other person.
********************************************************************

BAE Systems may process information about you that may be subject to data protection
laws. For more information about how we use your personal information, how we protect
your information, our legal basis for using your information, your rights and who you can
contact, please refer to our Privacy Notice at www.baesystems.com/en/privacy