Here's an analogy to help explain the skepticism: ants have amazing efficiency -...

klelatti · on June 22, 2020

> What I expect we'll see are ARM chips that are power and performance competitive with x86 chips only for specific curated use cases.

Sorry but there is no justification for this. With the same thermal constraints there is every expectation that an Apple / Arm CPU would be more performant and efficient than a comparable x86. Why? Because aarch64 doesn't have the historical legacy that x86 has and Apple has already shown what they can do in the iPad etc. Sure they won't be triple digits but it will enough to be noticeable.

And, as you say, they will have the advantage of Apple's custom silicon for specific use cases. So best of both worlds.

AshamedCaptain · on June 22, 2020

The comparison is a bit unfair. x86 is like a decade older than ARM. Not that much in retrospect. aarch64 is as "free of historical legacy" as x86_64 is (that is: not at all free). There is lot of cruft and even multiple ISAs in aarch64 (e.g. T32/Thumb).

And the CISC vs RISC arguments are questionable, seeing that Apple has done the migration in both directions by now.

tspike · on June 22, 2020

I noticed that Apple made absolutely no mention of ARM in their keynote. Seems like they're trying to whitelabel it for brand benefits as well as to divorce themselves from any expectations around standards?

klelatti · on June 22, 2020

That was interesting! Surely not an accident. Possibly to:

- Emphasise the breadth of their silicon expertise across CPU / GPU / Neural Engines etc. - Because Arm has little or no brand recognition (Apple > Intel > Arm in branding terms). - Distinguish from any me-too moves to Arm by competitors.

philistine · on June 23, 2020

There you go! Someone finally figured it out. Apple is moving to Apple Silicon, not ARM. Try to get LG to announce they’re offering a PC with Apple Silicon tomorrow.

The absence of any ARM mention is marketing, nothing more.

caf · on June 23, 2020

Which competitors? Windows and ChromeOS have already been sold on ARM hardware.

gsnedders · on June 23, 2020

They scarcely ever have with regards to iOS either; architecture has never been a talking point for their CPUs. How long was it from iPhone announcement to knowing it was ARM? How long from the Apple A4 announcement to knowing it was ARM?

stjohnswarts · on June 22, 2020

Yep, most of that old "cruft" is essentially unused and turned off. A log of critics of x86 don't really know what they're talking about. x86 is inefficient because they don't really have much incentive to end the status quo where performance is more important than power to most customers. People are happy with 3-4 hours out of their laptops so Intel and AMD aim for that and sacrifice power for performance, quit often that is the tradeoff in the design.

thaumasiotes · on June 23, 2020

> People are happy with 3-4 hours out of their laptops so Intel and AMD aim for that and sacrifice power for performance

Heck, I'm happy with 1 hour. I leave my laptop plugged in nearly 100% of the time. The point of the laptop is that it's easy to move, not that I want to use it while I'm in transit.

fauigerzigerk · on June 23, 2020

Intel missed the whole move of personal computing to mobile devices. Not missing out on that should have been incentive enough one might think.

klelatti · on June 22, 2020

Some is turned off but some still has to be dealt with (variable instruction lengths for example).

Intel tried to compete in mobile for a long time and failed even with a better manufacturing process.

AnthonyMouse · on June 22, 2020

> Some is turned off but some still has to be dealt with (variable instruction lengths for example).

Modern x86_64 processors don't actually natively execute x86 instructions, they translate them into the instructions the hardware actually uses. The percentage of the die required to do that translation is small and immaterial.

> Intel tried to compete in mobile for a long time and failed even with a better manufacturing process.

Intel didn't understand the market.

I recently bought a new phone. On paper it's twice as fast as my old phone. I imagine that's true but I can't tell any difference. Everything was sufficiently fast before and it still is. I never use my phone to do anything that needs an actually-fast CPU. I have no reason to pay additional money for a faster phone CPU. But I do notice how often I have to charge the battery.

These are not atypical purchasing criteria for mobile devices, but that's not the market Intel was chasing with their designs and pricing, so they failed. It's not because they couldn't make an x86 CPU for that market, it's because they didn't want to, because it's a lower margin commodity market.

sitkack · on June 22, 2020

Faster cpus become more power-efficient cpus because they can race to sleep. So you really do want to pay more for that cpu, but not for the compute performance but for the battery life.

https://en.wikichip.org/wiki/race-to-sleep

AnthonyMouse · on June 23, 2020

That's assuming the faster CPUs use the same amount of power. It's possible for a slower CPU to have better performance per watt. This is often exactly what happens when you limit clock speed -- performance goes down, performance per watt goes up.

kllrnohj · on June 22, 2020

> Intel tried to compete in mobile for a long time and failed even with a better manufacturing process.

They didn't fail because of performance, though, they failed because of app support & lack of a quality radio. The CPU performance & efficiency itself was otherwise fine. It wasn't always chart-topping good, but it wasn't bad either.

klelatti · on June 22, 2020

Agreed - CPUs (at the end at least) were fine. Also they were probably looking for bigger margins than were available.

General point is that I think that Arm has a small architectural advantage due to lack of cruft but that other factors are usually more important - e.g. the resources and quality of team behind implementation.

klelatti · on June 22, 2020

Sorry meant A64 rather than aarch64 as pretty sure that Apple hasn't supported 32 bit for a while now (so no T32 or Thumb) so the instruction set was announced in 2010 and definitely cleaner than x86.

Agreed that CISC vs RISC is very questionable by now.

KMag · on June 23, 2020

Provided that the software is correctly written, ARM's weaker memory model allows for more flexible instruction and I/O scheduling.

It seems most people feel that the DEC Alpha went too far in weakening the memory model to improve performance, but A64 seems to at least be near the sweet spot.

It's also not a huge amount of work that gets thrown away decoding x86 instructions in parallel, but there's non-zero overhead introduced by having the start location of the next instruction depend on what the current instruction is.

saagarjha · on June 23, 2020

The weaker memory model also uncovers synchronization bugs that have been papered over by x86's stronger semantics ;)

ryanpetrich · on June 22, 2020

Apple's recent aarch64 implementations don't support any of the 32 bit ARM instruction sets, and aarch64 is a significant departure from armv7

fiblye · on June 23, 2020

Your wording makes it sound like ARM is still being used just for smaller devices and controllers with very well defined and limited uses. General purpose computing is already possible with iPads and iPhones. They're just artificially limited by the OS.

iDevices weren't really made with games in mind, but they can push out performance that beats handheld gaming devices. Artists (including myself) use iPads extensively and the response time with the Apple Pencil beats out just about anything else on the market. The only limiting factor is the tiny memory that limits the file size and layer limit on some programs. They're just fine for watching video, and even multitasking with a video playing while working on something else. This is on tiny device with no active cooling and long battery life, beating out most laptops in the same price range.

I don't believe there is any curated use case. They're already more than capable of being general purpose computers. I mean, Apple is already openly advertising that they're making iPad OS more desktop-like and operable with mice and keyboards. Literally the only things holding them back are the OS and Apple's refusal to put some decent memory inside.

imtringued · on June 23, 2020

The OS is the curated use case. Multitasking is an afterthought. Once the OS is no longer "holding them back" the Apple chip will run into similar problems that Intel CPUs run into.

occamschainsaw · on June 23, 2020

Here's a counterexample (stretching your analogy a bit): Ants lifting the heaviest car in the world.

[1]https://www.top500.org/news/japan-captures-top500-crown-arm-... [2]https://news.ycombinator.com/item?id=23601098

eanzenberg · on June 22, 2020

They already have a mobile chip that is as fast as an active-thermally cooled notebook chip.

alkonaut · on June 22, 2020

Isn't that for specific benchmarks though such as some geekbench/specint or web browsing benchmarks? I worry about non-gpu floating point for example. There is so much hand-optimized AVX/SSE code out there in big apps.

klodolph · on June 22, 2020

There’s a fair bit of AVX/SSE code out there, but these days the vast bulk of AVX/SSE code is generated by the autovectorizer and that’s mostly going to work on NEON without a hitch. Clang enables the autovectorizer at -O2 by default.

I’d be interested in estimates of how much hand-written AVX/SSE your computer actually runs. The apps I’ve seen usually have a fairly small core of AVX/SSE code.

berkut · on June 23, 2020

They're admittedly not applications most users run every day, but many multimedia applications (audio processing, encoding, decoding) is mostly done with hand-crafted instrinsics, the same goes for video stuff.

In an even more niche area (high-end VFX apps, like compositors, renderers) SSE/AVX intrinsics are used quite a bit in performance-critical parts of the code, and auto-vectorisers can't yet do as good a job (they're pretty useless at conditionals and masking).

saagarjha · on June 23, 2020

Even less esoteric: your libc likely has at least a half dozen vectorized functions for the mem* and str* functions.

wolf550e · on June 22, 2020

But is the bulk of AVX code by time spent running, code that was generated by autovectorizer? The SIMD in openssl and ffmpeg is written by hand. I bet the code that spends a lot of time on the CPU, especially the code that runs a lot while humans are waiting, is written by hand.

throwaway5792 · on June 23, 2020

Those should have AArch64 versions written. AArch64 is old now, it's not some niche architecture.

wolf550e · on June 23, 2020

Desktop productivity content creation apps have never before needed ARM versions, so many probably don't have ARM specific optimizations, and some probably have x86 specific code that is just enabled by default.

The memory model differences are going to be painful to debug, I think ("all-the-world's-a-VAX syndrome" is now "all the world's a Pentium/x86-64").

jl6 · on June 22, 2020

"As fast" on specific curated use cases. Show me an Apple chip that beats any laptop on 7zip.

tshaddox · on June 22, 2020

I don’t know about 7zip specifically, but the iPad Pro seems to beat even some MacBook Pros on some benchmarks.

https://www.macrumors.com/2020/05/12/ipad-pro-vs-macbook-air...

tw04 · on June 22, 2020

Given that Amazon was able to get there, what makes you think Apple can't? I would struggle to believe that Annapurna Labs has any significant advantage over PA Semi given the track record PA has had since joining Apple, and the fact they had nearly a decade head-start.

https://www.anandtech.com/show/15578/cloud-clash-amazon-grav...

fock · on June 22, 2020

not sure, what they are measuring though - random Xeon from spec.org: https://www.spec.org/cpu2006/results/res2017q3/cpu2006-20170... ; at most 30% higher frequency, yet twice as fast. Well...

r00fus · on June 22, 2020

Is this a joke, what kind of usage benchmark is 7zipping large numbers of files?

ardy42 · on June 22, 2020

>>> They already have a mobile chip that is as fast as an active-thermally cooled notebook chip.

>> "As fast" on specific curated use cases. Show me an Apple chip that beats any laptop on 7zip.

> Is this a joke, what kind of usage benchmark is 7zipping large numbers of files?

A benchmark that Apple is unlikely to have implemented specific optimizations for, which therefore is a better test of the general purpose performance of the chip.

The situation being claimed here is sort of like if someone cited a DES benchmark to claim that Deep Crack's DES cracking chips (https://en.wikipedia.org/wiki/EFF_DES_cracker) were faster than a contemporary 1998 Pentium II.

r00fus · on June 22, 2020

A benchmark that probably relies as much on disk access speeds as CPU?

user5994461 · on June 22, 2020

Nope, 7zip is using LZMA algorithm for compression, which is around a few MB/s on the fastest CPU. It's heavily CPU bound.

edit: Just tried compressing a large file, ultra setting on my desktop i5 CPU, it's running at 3 MB/s on 1 core.

dwaite · on June 23, 2020

Apple ships a framework for doing lzma and other compression algorithms. I doubt they will be taken by surprise

r00fus · on June 22, 2020

One large file could be CPU bound, many small files (which is partly why you zip/jar things up) is disk bound.

stjohnswarts · on June 22, 2020

Not true. the bottleneck is going to be compression not disk access.

ardy42 · on June 22, 2020

> A benchmark that probably relies as much on disk access speeds as CPU?

If true, that's just a nitpick that doesn't affect the overall point of the GGGP, though.

AlexandrB · on June 22, 2020

I think the point is that it won’t be as fast in applications that Apple didn’t anticipate.

chipotle_coyote · on June 22, 2020

I believe it is what the poster you are replying to would call "a specific curated use case."

(Semi-seriously, I don't know anyone who uses a Unix(-like) system who uses 7zip, although I'm sure they're out there. For the record, I just unzipped a 120M archive on both my 2020 Core i7 MacBook Air and my 2018 (last-gen) iPad Pro and as near as I can tell the iPad was faster actually extracting the files, but had an extra second or so of overhead from the UI.)

ac29 · on June 22, 2020

> I don't know anyone who uses a Unix(-like) system who uses 7zip

7zip is an implementation of LZMA, like xz. So, different names and file format details, but essentially the same algorithm.

user5994461 · on June 22, 2020

Correct. 7zip is a LZMA compressor. The common equivalent command line tool on Linux is xz.

Linux distributions have been using xz compression for all packages (replacing gzip). So to the question of how relevant is xz/lzma/7zip performance to day to day task, is it's a lot relevant.

The successor will probably be zstd in the coming years. https://www.phoronix.com/scan.php?page=news_item&px=Fedora-3...

rjsw · on June 23, 2020

You can find a 7zip for UNIX here [1].

[1] http://p7zip.sourceforge.net/

robocat · on June 22, 2020

Actually the zip algorithm is a perfect candidate for dedicated silicon of a dedicated instruction.

I read about a chip that had that feature yesterday but I can’t find the link unfortunately.

Sleaker · on June 22, 2020

this is something that's looking at getting moved to storage controllers on motherboards, ex: PS5/Xbox consoles so that compressed data can be streamed directly to the GPU. Hopefully we'll start to get this type of tech after it's been proven in the console space.

fomine3 · on June 23, 2020

Intel QAT supports gzip compression.

43920 · on June 22, 2020

How about JavaScript execution? https://twitter.com/dhh/status/1043277162676072449

fock · on June 22, 2020

so could you provide some actual, open-source basic benchmarks. And not strange, opaque geekbench results...

I guess AMD is fine for me (as is my old Intel-notebook) and I'll just wait for POVray, GROMACS and Co.

EDIT: And well, I noticed, supposedly Anandtech ran SPECint2006 on the A13 (and numerous other chips) - they ran it with WSL for x86 (because running on a dozen Android things is easier than running a standard benchmark on Linux/Windows ofc). You find the results here: https://images.anandtech.com/doci/14892/spec2006-global-over... - I guess (not sure, because it's for some reason not cleary marked and mentioned...) these are SPECint2006-results. So, let's check them for validity (because WSL is no problem and it matches Linux ofc); just looking at an i7-6700K (which is a little bit behind the i9-9900K they supposedly ran on): https://www.spec.org/cpu2006/results/res2016q1/cpu2006-20160... - marginal worse performance than @anandtech in some benchmarks, but that's with an older CPU and an older arch! And then there are the 3 or 4 benchmarks which are just way off. Makes one wonder, what they really did (because of course, installing CentOS and running SPEC on native Linux is too much of a hassle, when running and compiling on 8 ARM-platforms!?)

EDITEDIT: it's even worse for SPECfp2006: https://www.spec.org/cpu2006/results/res2016q1/cpu2006-20160... [well, here the old 6700K suddenly sometimes is twice as fast as the 9900K and 3x as fast as the A13 (and yeah the story of the 2.8GHz-low-power-chip drawing circles around a 4.5GHz-HF-part just didn't sound convincing in the first place...)]

Veedrac · on June 23, 2020

The official results from spec.org have a bunch of cheating, eg. exploiting undefined behaviour to run a benchmark improperly. AnandTech uses a consistent compiler (Clang, not ICC) without settings to exploit this, hence the divergence.

Andrei mentions this here: https://www.realworldtech.com/forum/?threadid=187314&curpost...

KMag · on June 23, 2020

When I started at Google, I sat next to a guy who used to write compilers for DEC and Intel. I asked him, given the huge amount Google spends on hardware and electricity, if he thought that switching to ICC was worth while. His answer was basically that ICC is tuned to maximally exploit undefined behavior for marketing purposes and he wouldn't want to use it in production, at least without heavily tweaking flags to disable some optimizations. ICC gets most of its speed advantages by enabling optimizations that are present in GCC/Clang, but deemed too dangerous to turn on by default.

fock · on June 23, 2020

clang with "-Ofast"; I really wonder, which other options are there?! And imho this just doesn't explain the 200% difference?

And some other things which are bugging me:

- can you only optimize for A53 on Android with the big.LITTLE-configurations (they do!)?

- so they cross-compiled an AMD64-producing gcc 3.2 with Xcode 10 on MacOS-X for ARM? impressive.

looping__lui · on June 22, 2020

Yeah, last time they did that with Nvidia graphics cards just as Adobe released it’s new rendering engine everybody was really thrilled to learn they could also buy Apple video editing software that would not sht itself instead of using the Adobe tools because of that inherent Apple advantage...

E.g., a nice article from 2010: https://nofilmschool.com/2010/07/apple-snubs-adobe-again-wit...

dzhiurgis · on June 22, 2020

Intel can't shrink the die size to what TSMC/Global Foundries/Samsung can and they will never let them manufacture due to IP/national security/etc reasons.

IOT_Apprentice · on June 22, 2020

Are you saying that Amazon AWS's Graviton 2 EC 2 servers can't handle server level performance?

jimnotgym · on June 22, 2020

I wonder how that will pan out for the people running MS Office all day on their Macs...