> They named it AI because buzzword and marketing bullshit...
They named it AI because it massively boosts embarrassingly parallel workloads. You can think of Processing In Memory as rendering the mapPartitions() operation free in Spark's MapReduce ML workloads.
Some algorithms like DNA sequencing have a tradeoff between map and reduce [1]: you spend more time generating higher quality matches between the short sequences (map), before sending them for global matching (reduce). And PIM lets you exploit that.
For an order of magnitude: the average Intel has about 60GB/s of RAM bandwidth per socket. 256 GB of UPMEM's RAM let you have 2.5TB/s of local bandwidth to a computation unit (to 2560 'dumb' cores @400Mhz) [2].
I feel like push-down operations might be a better analogy from the mapreduce world?
It strikes me these processors would be most helpful in pre-multiplies, filter operations, and perhaps for scatters. All that stuff is not just relevant to tensorflow / pytorch stuff but also databases. While I’m sure the “AI” labeling is pure marketing, I’d imagine Samsung would love to target workloads beyond deep learning training and inference.
> All that stuff is not just relevant to tensorflow / pytorch stuff but also databases.
Yes! and that's the beauty of it. It is not an accelerator, these are fully generic cores.
Not equivalent to 'smart' Intel cores with all the branch prediction, prefetching and caching magic; but with massive computation capabilities nonetheless.
GPUs do have massive amounts of memory (both in RAM and registers), but you have to have preloaded your stuff into it beforehand. And what you can actually do efficiently are SIMD operations.
I'd liken PIM to a better GPU-CPU blend: you get to keep your CPU doing its things with massive parallel operations concurrently. Also, these seem to be mostly independent cores, so you would not be limited to SIMD.
Let's bet: in 10 years, AWS will have a new offering: the 'nano lambda'. You get a PIM core share, with 10 MB local 'persistent' RAM (keeping your data + a continuation of your code when it is not running), running your tiny Loom thread [1], at the edge, billed at 1us granularity, only when it is running, and for 0.0000000000000001 USD per us.
Huh I don't know about 'nano lambda', but I could see it being an add-on to S3. Customers already have petabytes upon petabytes there and if you could push down a filter or pre-multiply op and save a million bucks then sure people would do that. If EMR does it automagically then even better.
When could it happen? My guess is S3 favors very old compute hardware but I could be wrong. If it does, though, that could also mean it's just very cheap to replace ;)
Well PIM or Processing In Memory or Computational Memory [1] isn't new. The Question is what exactly did they put in those memory or what are the target performance speed up in that specific domain. This PR provides embarrassingly little details.
Yep, this is indeed very little detail. To me this PR is the PR you send when you just discovered you're late, but try to retain the clients' attention.
They felt the wind turning, saw the market ask for it, saw companies like UPMEM having a lead.
If you want more details, UPMEM has more. If I recall correctly they etched their own core right into the same silicon as the DRAM.
If it interests you - stated at the bottom of the press release, they appear to be presenting a conference accepted paper on this.
Looking at the ISSCC program, I presume this is what they're talking about: "A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications" [1]. I suspect this is where there will be a deep dive into the gory technical details of the architecture and performance. It's tomorrow so it won't be a long wait.
GPU mostly does SIMD. This would enable to do independent workloads. If your simple single threaded webserver only requires 20MB of RAM, you could just run it in its tiny chunk of RAM; and it would sometimes ask the CPU to coordinate with the NIC.
Also, a GPU requires transport of data to it. PIM lets the CPU do concurrent work at very low latencies.
AI is one of the very few domains where processing-in-memory makes sense. This particular system is a little slapdash, but I feel strongly that in the very near future, this is the only architecture that will be used for AI.
(Meaning, relatively small chiplet AI processors with ram stacked on top of them.)
The reason for that is that as the precision used for the coefficients has gone down, the relative energy cost of doing computation on them has turned into a rounding error when compared to the cost of moving data to the alus, and in AI there is very little cost of distributing the processing power into many small chips, which are relatively far away from each other.
AI is likely to be the biggest beneficiary of this architecture. The on-memory processing chips are likely to be simpler than the CPU ones (i.e. more akin to GPU cores), and allows parallelism- both of which point to numerical processing and AI.
I like the idea of flipping DIMMs to get capacity and processing improvements, and also the thought that mass-produced memory with this tech could potentially significantly reduce the cost of AI hardware through commoditization
But seriously though, it seems to answer an ancient techie question of mine: Since we're strobing memories millions/billions times per second, couldn't they be doing more than storage with all those clocks?
Its always a trade-off: you have to balance the raw area of packed memory cell rows against all the „support“ fluff: column precharge circuitry, readout buffers, adress decoders, etc. I am also unsure whether non-uniform random access times would break some of the abstractions about RAM memory as well. In NAND flash that sort of page-bank parallelism is integrated, since operations are slow.
I speculate that the eventual ideal goal to strive towards will be RAM strip-to-strip processing taking all of one module's data, feeding one layer and dumping results into the next module. The individual layers accessible for both read and write as ordinary RAM.
This is a great step and all, but shouldn't we be being a little more adventurous? A unified understanding of computation and thermodynamics has the potential to enable systems that are vastly more capable. We are piddling around in the shallow end making incremental improvements. A few billion thrown in novel directions could reap extraordinary rewards.
I understand where are you're coming from, but I would rather see modular and stackable pieces affordable by ordinary users and hackable for power users.
Hackable mainly because of the nature of neural networks - their architecture matters.
> Vastly more capable systems
I interpret this as specialized silicon that's mass produced? I urge you to remember how much academics and hobbyists gain from having FPGAs around, despite their relative bulkiness and mediocre parameters.
Is this really the same thing? I dont have an account so I couldn't read the whole thing.
>Circuit and design techniques are presented for enhancing the performance and reliability of a 3-D-stacked high bandwidth memory-2 extension (HBM2E). A data-bus window extension technique is implemented to cope with reduced clock cycle time ranging from data-path architecture, through-silicon via (TSV) placement, and TSV-PHY alignment. A power TSV placement in the middle of array and at the chip edge along with a dedicated top metal for power mesh improves power IR drop by 62%. An on-die ECC (OD-ECC) scheme featuring a self-scrubbing function is designed to be orthogonal to system ECC. An uncorrectable bit error rate (UBER) is improved by 10 5 times with the proposed OD-ECC and scrubbing scheme. A memory built-in self-test (MBIST) block supports low-frequency cell and core test in a parallel manner and all channel at-speed operation with adjustable ac parameters. The proposed parallel-bit MBIST reduces test time by 66%. A 16-GB HBM2E fabricated in the second generation of 10-nm class DRAM process achieves a bandwidth up to 640 GB/s (5 Gb/s/pin) and provides a stable bit-cell operation at a high temperature
None of the items in Abstract has anything to do with AI.
Ah right it's behind a paywall, sorry. The introduction opens with:
> Rapidly evolving artificial intelligence (AI) technology, such as deep learning, has been successfully deployed in various applications, such as image recognition, health care, and autonomous driving. Such rapid evolution and successful deployment of AI technology have been possible owing to the emergence of accelerators, such as GPUs and TPUs, that have a higher data throughput.
Edit: You might be right, I peeked into the ISSCC programme looking for something relevant from Samsung, and they are presenting a paper titled "A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications". However, there is a lot of overlap in paper authors, so I'd imagine it's the same team.
Another recent primer on in-memory / near-memory computing in [1]. Upmem [2] is also selling memory with on-board compute. A space that is slowly hotting up!
[1] O. Mutlu, S. Ghose, J. Gomez-Luna, R. Ausavarungnirun, A Modern Primer on Processing in Memory.https://arxiv.org/abs/2012.03112
Tip: if you contact the author(s) of a paper that is of interest to you and ask for a version of it, there's a good chance that they'll gladly accommodate. I think generally authors don't even have any financial benefit if you pay for the paper (it all goes to the publisher).
However, i read the title as: "We couldn't think of anything good about the product, so we added a buzzword in fashion."
Same comments mostly say this has nothing to do with AI.