Samsung Develops Industry’s First High Bandwidth Memory with AI Processing Power

nottorp · on Feb 17, 2021

Hmm the HN comments say that it's kinda interesting.

However, i read the title as: "We couldn't think of anything good about the product, so we added a buzzword in fashion."

Same comments mostly say this has nothing to do with AI.

greatgib · on Feb 17, 2021

Does not look like very fancy or innovative.

In the end, they just put a dedicated coprocessor directly with their memory chip. They named it AI because buzzword and marketing bullshit...

BenoitP · on Feb 17, 2021

> They named it AI because buzzword and marketing bullshit...

They named it AI because it massively boosts embarrassingly parallel workloads. You can think of Processing In Memory as rendering the mapPartitions() operation free in Spark's MapReduce ML workloads.

Some algorithms like DNA sequencing have a tradeoff between map and reduce [1]: you spend more time generating higher quality matches between the short sequences (map), before sending them for global matching (reduce). And PIM lets you exploit that.

For an order of magnitude: the average Intel has about 60GB/s of RAM bandwidth per socket. 256 GB of UPMEM's RAM let you have 2.5TB/s of local bandwidth to a computation unit (to 2560 'dumb' cores @400Mhz) [2].

[1] https://www.researchgate.net/publication/346703874_Variant_C...

[2] https://www.upmem.com/technology/

choppaface · on Feb 17, 2021

I feel like push-down operations might be a better analogy from the mapreduce world?

It strikes me these processors would be most helpful in pre-multiplies, filter operations, and perhaps for scatters. All that stuff is not just relevant to tensorflow / pytorch stuff but also databases. While I’m sure the “AI” labeling is pure marketing, I’d imagine Samsung would love to target workloads beyond deep learning training and inference.

BenoitP · on Feb 17, 2021

> All that stuff is not just relevant to tensorflow / pytorch stuff but also databases.

Yes! and that's the beauty of it. It is not an accelerator, these are fully generic cores.

Not equivalent to 'smart' Intel cores with all the branch prediction, prefetching and caching magic; but with massive computation capabilities nonetheless.

GPUs do have massive amounts of memory (both in RAM and registers), but you have to have preloaded your stuff into it beforehand. And what you can actually do efficiently are SIMD operations.

I'd liken PIM to a better GPU-CPU blend: you get to keep your CPU doing its things with massive parallel operations concurrently. Also, these seem to be mostly independent cores, so you would not be limited to SIMD.

Let's bet: in 10 years, AWS will have a new offering: the 'nano lambda'. You get a PIM core share, with 10 MB local 'persistent' RAM (keeping your data + a continuation of your code when it is not running), running your tiny Loom thread [1], at the edge, billed at 1us granularity, only when it is running, and for 0.0000000000000001 USD per us.

[1] https://cr.openjdk.java.net/~rpressler/loom/Loom-Proposal.ht...

choppaface · on Feb 18, 2021

Huh I don't know about 'nano lambda', but I could see it being an add-on to S3. Customers already have petabytes upon petabytes there and if you could push down a filter or pre-multiply op and save a million bucks then sure people would do that. If EMR does it automagically then even better.

When could it happen? My guess is S3 favors very old compute hardware but I could be wrong. If it does, though, that could also mean it's just very cheap to replace ;)

BenoitP · on Feb 19, 2021

For predicate pushdown in S3, I'd see that operation in the SSD controller. You wouldn't even have to load the date in RAM before filtering it.

ksec · on Feb 17, 2021

Well PIM or Processing In Memory or Computational Memory [1] isn't new. The Question is what exactly did they put in those memory or what are the target performance speed up in that specific domain. This PR provides embarrassingly little details.

[1] https://en.wikipedia.org/wiki/Computational_RAM

BenoitP · on Feb 17, 2021

Yep, this is indeed very little detail. To me this PR is the PR you send when you just discovered you're late, but try to retain the clients' attention.

They felt the wind turning, saw the market ask for it, saw companies like UPMEM having a lead.

If you want more details, UPMEM has more. If I recall correctly they etched their own core right into the same silicon as the DRAM.

Here is a link that I think will satisfy your curiosity: https://www.upmem.com/nextplatform-com-2019-10-03-accelerati...

tmotwu · on Feb 17, 2021

If it interests you - stated at the bottom of the press release, they appear to be presenting a conference accepted paper on this.

Looking at the ISSCC program, I presume this is what they're talking about: "A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications" [1]. I suspect this is where there will be a deep dive into the gory technical details of the architecture and performance. It's tomorrow so it won't be a long wait.

[1] https://submissions2.mirasmart.com/ISSCC2021/PDF/ISSCC2021Ad... (p 38)

why_Mr_Anderson · on Feb 17, 2021

embarrassingly parallel workloads - so...GPU?

BenoitP · on Feb 19, 2021

GPU mostly does SIMD. This would enable to do independent workloads. If your simple single threaded webserver only requires 20MB of RAM, you could just run it in its tiny chunk of RAM; and it would sometimes ask the CPU to coordinate with the NIC.

Also, a GPU requires transport of data to it. PIM lets the CPU do concurrent work at very low latencies.

Tuna-Fish · on Feb 17, 2021

AI is one of the very few domains where processing-in-memory makes sense. This particular system is a little slapdash, but I feel strongly that in the very near future, this is the only architecture that will be used for AI.

(Meaning, relatively small chiplet AI processors with ram stacked on top of them.)

The reason for that is that as the precision used for the coefficients has gone down, the relative energy cost of doing computation on them has turned into a rounding error when compared to the cost of moving data to the alus, and in AI there is very little cost of distributing the processing power into many small chips, which are relatively far away from each other.

krona · on Feb 17, 2021

By AI you mean matrix multiplication on single/half-precision floats? You could be a bit more specific.

shihab · on Feb 17, 2021

AI is likely to be the biggest beneficiary of this architecture. The on-memory processing chips are likely to be simpler than the CPU ones (i.e. more akin to GPU cores), and allows parallelism- both of which point to numerical processing and AI.

jabberwcky · on Feb 17, 2021

I like the idea of flipping DIMMs to get capacity and processing improvements, and also the thought that mass-produced memory with this tech could potentially significantly reduce the cost of AI hardware through commoditization

jjcon · on Feb 17, 2021

I don’t think many people outside of AI will have a use for this (at least at first) so why would you market it any other way?

smolder · on Feb 17, 2021

Many computing tasks are very parallel as well as depending on memory bandwidth. I think it'd be useful for almost any of them.

spacemanmatt · on Feb 17, 2021

Now that's what I call edge computing.

But seriously though, it seems to answer an ancient techie question of mine: Since we're strobing memories millions/billions times per second, couldn't they be doing more than storage with all those clocks?

artemonster · on Feb 17, 2021

Its always a trade-off: you have to balance the raw area of packed memory cell rows against all the „support“ fluff: column precharge circuitry, readout buffers, adress decoders, etc. I am also unsure whether non-uniform random access times would break some of the abstractions about RAM memory as well. In NAND flash that sort of page-bank parallelism is integrated, since operations are slow.

loa_in_ · on Feb 17, 2021

I speculate that the eventual ideal goal to strive towards will be RAM strip-to-strip processing taking all of one module's data, feeding one layer and dumping results into the next module. The individual layers accessible for both read and write as ordinary RAM.

plutonorm · on Feb 17, 2021

This is a great step and all, but shouldn't we be being a little more adventurous? A unified understanding of computation and thermodynamics has the potential to enable systems that are vastly more capable. We are piddling around in the shallow end making incremental improvements. A few billion thrown in novel directions could reap extraordinary rewards.

https://arxiv.org/abs/1911.01968

loa_in_ · on Feb 17, 2021

I understand where are you're coming from, but I would rather see modular and stackable pieces affordable by ordinary users and hackable for power users.

Hackable mainly because of the nature of neural networks - their architecture matters.

> Vastly more capable systems

I interpret this as specialized silicon that's mass produced? I urge you to remember how much academics and hobbyists gain from having FPGAs around, despite their relative bulkiness and mediocre parameters.

tmotwu · on Feb 17, 2021

Paper with more details: https://ieeexplore.ieee.org/document/9240974

ksec · on Feb 17, 2021

Is this really the same thing? I dont have an account so I couldn't read the whole thing.

>Circuit and design techniques are presented for enhancing the performance and reliability of a 3-D-stacked high bandwidth memory-2 extension (HBM2E). A data-bus window extension technique is implemented to cope with reduced clock cycle time ranging from data-path architecture, through-silicon via (TSV) placement, and TSV-PHY alignment. A power TSV placement in the middle of array and at the chip edge along with a dedicated top metal for power mesh improves power IR drop by 62%. An on-die ECC (OD-ECC) scheme featuring a self-scrubbing function is designed to be orthogonal to system ECC. An uncorrectable bit error rate (UBER) is improved by 10 5 times with the proposed OD-ECC and scrubbing scheme. A memory built-in self-test (MBIST) block supports low-frequency cell and core test in a parallel manner and all channel at-speed operation with adjustable ac parameters. The proposed parallel-bit MBIST reduces test time by 66%. A 16-GB HBM2E fabricated in the second generation of 10-nm class DRAM process achieves a bandwidth up to 640 GB/s (5 Gb/s/pin) and provides a stable bit-cell operation at a high temperature

None of the items in Abstract has anything to do with AI.

tmotwu · on Feb 17, 2021

Ah right it's behind a paywall, sorry. The introduction opens with:

> Rapidly evolving artificial intelligence (AI) technology, such as deep learning, has been successfully deployed in various applications, such as image recognition, health care, and autonomous driving. Such rapid evolution and successful deployment of AI technology have been possible owing to the emergence of accelerators, such as GPUs and TPUs, that have a higher data throughput.

Edit: You might be right, I peeked into the ISSCC programme looking for something relevant from Samsung, and they are presenting a paper titled "A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications". However, there is a lot of overlap in paper authors, so I'd imagine it's the same team.

jabberwcky · on Feb 17, 2021

Dubious energy savings claims, but sounds like potentially awesome tech. Looking forward to their slides/paper next week

GregarianChild · on Feb 17, 2021

Another recent primer on in-memory / near-memory computing in [1]. Upmem [2] is also selling memory with on-board compute. A space that is slowly hotting up!

[1] O. Mutlu, S. Ghose, J. Gomez-Luna, R. Ausavarungnirun, A Modern Primer on Processing in Memory. https://arxiv.org/abs/2012.03112

[2] https://www.upmem.com/

BenoitP · on Feb 17, 2021

Transport uncached 32 bits from RAM: 650 pJ

32 bit multiplication : 3 pJ [1]

The energy savings come from not transporting data.

[1] http://www.sigmod2014.org/damon/slides/picojoule.kozyrakis.p...

shihab · on Feb 17, 2021

A relevant paper from '19 (Behind paywall) -

https://ieeexplore.ieee.org/document/9073325

Edit:

A more accessible (in both senses) survey paper on Near-Memory Computing:

https://arxiv.org/abs/1908.02640.pdf

virgilp · on Feb 17, 2021

> Behind paywall

Tip: if you contact the author(s) of a paper that is of interest to you and ask for a version of it, there's a good chance that they'll gladly accommodate. I think generally authors don't even have any financial benefit if you pay for the paper (it all goes to the publisher).

jabberwcky · on Feb 17, 2021

Thanks

(obligatory scihub reference)

pulse7 · on Feb 17, 2021

Can you use this processing-in-memory (PIM) to perform garbage collection in memory? (Like in this article from RISC-V board member Krste Asanovic: https://people.eecs.berkeley.edu/~krste/papers/maas-isca18-h...)

tgtweak · on Feb 17, 2021

I think you would still need to interpret the output outside of the pim so it wouldn't be an on-dimm or universal system by any means.

It might open the door to more sexy error correction or caching.

phendrenad2 · on Feb 17, 2021

Aren't there companies already putting CPUs in RAM? This isn't anything new.

tmotwu · on Feb 17, 2021

Not the first PIM, it claims to be the first industry HBM-PIM for DL/ML. It's actually a practical use case for hardware DL.

dekhn · on Feb 17, 2021

All of computing is an exercise in moving compute closer to the data.

tromp · on Feb 17, 2021

That's a futile exercise if the computation involves repeatedly combining random bits of data.