Here's an analogy to help explain the skepticism: ants have amazing efficiency - they can lift multiples of their own body weight. So why can't an ant lift my car? Well, because it's too small. So let's just take the same ant design and scale it up? Unfortunately, it doesn't work like that. A creature capable of lifting my car wouldn't be much like an ant.
There is no guarantee that a phone-scale CPU can just become 4x faster by 4x'ing the power/TDP/die area. If it were that easy, Intel would already have done it (and no, the x86 architecture isn't so terribly inefficient that they are leaving triple-digit percentage improvements on the table).
What I expect we'll see are ARM chips that are power and performance competitive with x86 chips only for specific curated use cases. Apple will extract an advantage by putting custom hardware acceleration into them, to cater for those specific tasks. They will not be able to achieve general purpose performance improvements wildly beyond what Intel can already do.
This is how the current iDevices achieve their excellent performance and battery life. Not through raw general-purpose CPU horsepower, but by a finely tuned synergy between hardware and software. Apple are taking their desktop down the same route. This will be the ultimate competitive advantage for their own software - they will be able to move key software components into hardware, and make it look like magic. But as a developer, you won't be able to participate in this unless you target Apple-blessed hardware instructions/APIs. Your Python script isn't going to start running 4x faster unless you can convince Apple to implement its inner loop in custom silicon.
I have no doubt that Apple will be leaning hard into ASIC territory as they build out their new CPUs. The endgame? Every software function you need, baked into perfectly optimised silicon by the monovendor.
> What I expect we'll see are ARM chips that are power and performance competitive with x86 chips only for specific curated use cases.
Sorry but there is no justification for this. With the same thermal constraints there is every expectation that an Apple / Arm CPU would be more performant and efficient than a comparable x86. Why? Because aarch64 doesn't have the historical legacy that x86 has and Apple has already shown what they can do in the iPad etc. Sure they won't be triple digits but it will enough to be noticeable.
And, as you say, they will have the advantage of Apple's custom silicon for specific use cases. So best of both worlds.
The comparison is a bit unfair. x86 is like a decade older than ARM. Not that much in retrospect. aarch64 is as "free of historical legacy" as x86_64 is (that is: not at all free). There is lot of cruft and even multiple ISAs in aarch64 (e.g. T32/Thumb).
And the CISC vs RISC arguments are questionable, seeing that Apple has done the migration in both directions by now.
I noticed that Apple made absolutely no mention of ARM in their keynote. Seems like they're trying to whitelabel it for brand benefits as well as to divorce themselves from any expectations around standards?
That was interesting! Surely not an accident. Possibly to:
- Emphasise the breadth of their silicon expertise across CPU / GPU / Neural Engines etc.
- Because Arm has little or no brand recognition (Apple > Intel > Arm in branding terms).
- Distinguish from any me-too moves to Arm by competitors.
There you go! Someone finally figured it out. Apple is moving to Apple Silicon, not ARM. Try to get LG to announce they’re offering a PC with Apple Silicon tomorrow.
The absence of any ARM mention is marketing, nothing more.
They scarcely ever have with regards to iOS either; architecture has never been a talking point for their CPUs. How long was it from iPhone announcement to knowing it was ARM? How long from the Apple A4 announcement to knowing it was ARM?
Yep, most of that old "cruft" is essentially unused and turned off. A log of critics of x86 don't really know what they're talking about. x86 is inefficient because they don't really have much incentive to end the status quo where performance is more important than power to most customers. People are happy with 3-4 hours out of their laptops so Intel and AMD aim for that and sacrifice power for performance, quit often that is the tradeoff in the design.
> People are happy with 3-4 hours out of their laptops so Intel and AMD aim for that and sacrifice power for performance
Heck, I'm happy with 1 hour. I leave my laptop plugged in nearly 100% of the time. The point of the laptop is that it's easy to move, not that I want to use it while I'm in transit.
> Some is turned off but some still has to be dealt with (variable instruction lengths for example).
Modern x86_64 processors don't actually natively execute x86 instructions, they translate them into the instructions the hardware actually uses. The percentage of the die required to do that translation is small and immaterial.
> Intel tried to compete in mobile for a long time and failed even with a better manufacturing process.
Intel didn't understand the market.
I recently bought a new phone. On paper it's twice as fast as my old phone. I imagine that's true but I can't tell any difference. Everything was sufficiently fast before and it still is. I never use my phone to do anything that needs an actually-fast CPU. I have no reason to pay additional money for a faster phone CPU. But I do notice how often I have to charge the battery.
These are not atypical purchasing criteria for mobile devices, but that's not the market Intel was chasing with their designs and pricing, so they failed. It's not because they couldn't make an x86 CPU for that market, it's because they didn't want to, because it's a lower margin commodity market.
Faster cpus become more power-efficient cpus because they can race to sleep. So you really do want to pay more for that cpu, but not for the compute performance but for the battery life.
That's assuming the faster CPUs use the same amount of power. It's possible for a slower CPU to have better performance per watt. This is often exactly what happens when you limit clock speed -- performance goes down, performance per watt goes up.
> Intel tried to compete in mobile for a long time and failed even with a better manufacturing process.
They didn't fail because of performance, though, they failed because of app support & lack of a quality radio. The CPU performance & efficiency itself was otherwise fine. It wasn't always chart-topping good, but it wasn't bad either.
Agreed - CPUs (at the end at least) were fine. Also they were probably looking for bigger margins than were available.
General point is that I think that Arm has a small architectural advantage due to lack of cruft but that other factors are usually more important - e.g. the resources and quality of team behind implementation.
Sorry meant A64 rather than aarch64 as pretty sure that Apple hasn't supported 32 bit for a while now (so no T32 or Thumb) so the instruction set was announced in 2010 and definitely cleaner than x86.
Agreed that CISC vs RISC is very questionable by now.
Provided that the software is correctly written, ARM's weaker memory model allows for more flexible instruction and I/O scheduling.
It seems most people feel that the DEC Alpha went too far in weakening the memory model to improve performance, but A64 seems to at least be near the sweet spot.
It's also not a huge amount of work that gets thrown away decoding x86 instructions in parallel, but there's non-zero overhead introduced by having the start location of the next instruction depend on what the current instruction is.
Your wording makes it sound like ARM is still being used just for smaller devices and controllers with very well defined and limited uses. General purpose computing is already possible with iPads and iPhones. They're just artificially limited by the OS.
iDevices weren't really made with games in mind, but they can push out performance that beats handheld gaming devices. Artists (including myself) use iPads extensively and the response time with the Apple Pencil beats out just about anything else on the market. The only limiting factor is the tiny memory that limits the file size and layer limit on some programs. They're just fine for watching video, and even multitasking with a video playing while working on something else. This is on tiny device with no active cooling and long battery life, beating out most laptops in the same price range.
I don't believe there is any curated use case. They're already more than capable of being general purpose computers. I mean, Apple is already openly advertising that they're making iPad OS more desktop-like and operable with mice and keyboards. Literally the only things holding them back are the OS and Apple's refusal to put some decent memory inside.
The OS is the curated use case. Multitasking is an afterthought. Once the OS is no longer "holding them back" the Apple chip will run into similar problems that Intel CPUs run into.
Isn't that for specific benchmarks though such as some geekbench/specint or web browsing benchmarks? I worry about non-gpu floating point for example. There is so much hand-optimized AVX/SSE code out there in big apps.
There’s a fair bit of AVX/SSE code out there, but these days the vast bulk of AVX/SSE code is generated by the autovectorizer and that’s mostly going to work on NEON without a hitch. Clang enables the autovectorizer at -O2 by default.
I’d be interested in estimates of how much hand-written AVX/SSE your computer actually runs. The apps I’ve seen usually have a fairly small core of AVX/SSE code.
They're admittedly not applications most users run every day, but many multimedia applications (audio processing, encoding, decoding) is mostly done with hand-crafted instrinsics, the same goes for video stuff.
In an even more niche area (high-end VFX apps, like compositors, renderers) SSE/AVX intrinsics are used quite a bit in performance-critical parts of the code, and auto-vectorisers can't yet do as good a job (they're pretty useless at conditionals and masking).
But is the bulk of AVX code by time spent running, code that was generated by autovectorizer? The SIMD in openssl and ffmpeg is written by hand. I bet the code that spends a lot of time on the CPU, especially the code that runs a lot while humans are waiting, is written by hand.
Desktop productivity content creation apps have never before needed ARM versions, so many probably don't have ARM specific optimizations, and some probably have x86 specific code that is just enabled by default.
The memory model differences are going to be painful to debug, I think ("all-the-world's-a-VAX syndrome" is now "all the world's a Pentium/x86-64").
Given that Amazon was able to get there, what makes you think Apple can't? I would struggle to believe that Annapurna Labs has any significant advantage over PA Semi given the track record PA has had since joining Apple, and the fact they had nearly a decade head-start.
>>> They already have a mobile chip that is as fast as an active-thermally cooled notebook chip.
>> "As fast" on specific curated use cases. Show me an Apple chip that beats any laptop on 7zip.
> Is this a joke, what kind of usage benchmark is 7zipping large numbers of files?
A benchmark that Apple is unlikely to have implemented specific optimizations for, which therefore is a better test of the general purpose performance of the chip.
The situation being claimed here is sort of like if someone cited a DES benchmark to claim that Deep Crack's DES cracking chips (https://en.wikipedia.org/wiki/EFF_DES_cracker) were faster than a contemporary 1998 Pentium II.
I believe it is what the poster you are replying to would call "a specific curated use case."
(Semi-seriously, I don't know anyone who uses a Unix(-like) system who uses 7zip, although I'm sure they're out there. For the record, I just unzipped a 120M archive on both my 2020 Core i7 MacBook Air and my 2018 (last-gen) iPad Pro and as near as I can tell the iPad was faster actually extracting the files, but had an extra second or so of overhead from the UI.)
Correct. 7zip is a LZMA compressor. The common equivalent command line tool on Linux is xz.
Linux distributions have been using xz compression for all packages (replacing gzip). So to the question of how relevant is xz/lzma/7zip performance to day to day task, is it's a lot relevant.
this is something that's looking at getting moved to storage controllers on motherboards, ex: PS5/Xbox consoles so that compressed data can be streamed directly to the GPU. Hopefully we'll start to get this type of tech after it's been proven in the console space.
so could you provide some actual, open-source basic benchmarks. And not strange, opaque geekbench results...
I guess AMD is fine for me (as is my old Intel-notebook) and I'll just wait for POVray, GROMACS and Co.
EDIT:
And well, I noticed, supposedly Anandtech ran SPECint2006 on the A13 (and numerous other chips) - they ran it with WSL for x86 (because running on a dozen Android things is easier than running a standard benchmark on Linux/Windows ofc). You find the results here: https://images.anandtech.com/doci/14892/spec2006-global-over... - I guess (not sure, because it's for some reason not cleary marked and mentioned...) these are SPECint2006-results. So, let's check them for validity (because WSL is no problem and it matches Linux ofc); just looking at an i7-6700K (which is a little bit behind the i9-9900K they supposedly ran on):
https://www.spec.org/cpu2006/results/res2016q1/cpu2006-20160... - marginal worse performance than @anandtech in some benchmarks, but that's with an older CPU and an older arch! And then there are the 3 or 4 benchmarks which are just way off. Makes one wonder, what they really did (because of course, installing CentOS and running SPEC on native Linux is too much of a hassle, when running and compiling on 8 ARM-platforms!?)
EDITEDIT: it's even worse for SPECfp2006: https://www.spec.org/cpu2006/results/res2016q1/cpu2006-20160... [well, here the old 6700K suddenly sometimes is twice as fast as the 9900K and 3x as fast as the A13 (and yeah the story of the 2.8GHz-low-power-chip drawing circles around a 4.5GHz-HF-part just didn't sound convincing in the first place...)]
The official results from spec.org have a bunch of cheating, eg. exploiting undefined behaviour to run a benchmark improperly. AnandTech uses a consistent compiler (Clang, not ICC) without settings to exploit this, hence the divergence.
When I started at Google, I sat next to a guy who used to write compilers for DEC and Intel. I asked him, given the huge amount Google spends on hardware and electricity, if he thought that switching to ICC was worth while. His answer was basically that ICC is tuned to maximally exploit undefined behavior for marketing purposes and he wouldn't want to use it in production, at least without heavily tweaking flags to disable some optimizations. ICC gets most of its speed advantages by enabling optimizations that are present in GCC/Clang, but deemed too dangerous to turn on by default.
Yeah, last time they did that with Nvidia graphics cards just as Adobe released it’s new rendering engine everybody was really thrilled to learn they could also buy Apple video editing software that would not sht itself instead of using the Adobe tools because of that inherent Apple advantage...
Intel can't shrink the die size to what TSMC/Global Foundries/Samsung can and they will never let them manufacture due to IP/national security/etc reasons.
There is no guarantee that a phone-scale CPU can just become 4x faster by 4x'ing the power/TDP/die area. If it were that easy, Intel would already have done it (and no, the x86 architecture isn't so terribly inefficient that they are leaving triple-digit percentage improvements on the table).
What I expect we'll see are ARM chips that are power and performance competitive with x86 chips only for specific curated use cases. Apple will extract an advantage by putting custom hardware acceleration into them, to cater for those specific tasks. They will not be able to achieve general purpose performance improvements wildly beyond what Intel can already do.
This is how the current iDevices achieve their excellent performance and battery life. Not through raw general-purpose CPU horsepower, but by a finely tuned synergy between hardware and software. Apple are taking their desktop down the same route. This will be the ultimate competitive advantage for their own software - they will be able to move key software components into hardware, and make it look like magic. But as a developer, you won't be able to participate in this unless you target Apple-blessed hardware instructions/APIs. Your Python script isn't going to start running 4x faster unless you can convince Apple to implement its inner loop in custom silicon.
I have no doubt that Apple will be leaning hard into ASIC territory as they build out their new CPUs. The endgame? Every software function you need, baked into perfectly optimised silicon by the monovendor.