Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Here's an analogy to help explain the skepticism: ants have amazing efficiency - they can lift multiples of their own body weight. So why can't an ant lift my car? Well, because it's too small. So let's just take the same ant design and scale it up? Unfortunately, it doesn't work like that. A creature capable of lifting my car wouldn't be much like an ant.

There is no guarantee that a phone-scale CPU can just become 4x faster by 4x'ing the power/TDP/die area. If it were that easy, Intel would already have done it (and no, the x86 architecture isn't so terribly inefficient that they are leaving triple-digit percentage improvements on the table).

What I expect we'll see are ARM chips that are power and performance competitive with x86 chips only for specific curated use cases. Apple will extract an advantage by putting custom hardware acceleration into them, to cater for those specific tasks. They will not be able to achieve general purpose performance improvements wildly beyond what Intel can already do.

This is how the current iDevices achieve their excellent performance and battery life. Not through raw general-purpose CPU horsepower, but by a finely tuned synergy between hardware and software. Apple are taking their desktop down the same route. This will be the ultimate competitive advantage for their own software - they will be able to move key software components into hardware, and make it look like magic. But as a developer, you won't be able to participate in this unless you target Apple-blessed hardware instructions/APIs. Your Python script isn't going to start running 4x faster unless you can convince Apple to implement its inner loop in custom silicon.

I have no doubt that Apple will be leaning hard into ASIC territory as they build out their new CPUs. The endgame? Every software function you need, baked into perfectly optimised silicon by the monovendor.



> What I expect we'll see are ARM chips that are power and performance competitive with x86 chips only for specific curated use cases.

Sorry but there is no justification for this. With the same thermal constraints there is every expectation that an Apple / Arm CPU would be more performant and efficient than a comparable x86. Why? Because aarch64 doesn't have the historical legacy that x86 has and Apple has already shown what they can do in the iPad etc. Sure they won't be triple digits but it will enough to be noticeable.

And, as you say, they will have the advantage of Apple's custom silicon for specific use cases. So best of both worlds.


The comparison is a bit unfair. x86 is like a decade older than ARM. Not that much in retrospect. aarch64 is as "free of historical legacy" as x86_64 is (that is: not at all free). There is lot of cruft and even multiple ISAs in aarch64 (e.g. T32/Thumb).

And the CISC vs RISC arguments are questionable, seeing that Apple has done the migration in both directions by now.


I noticed that Apple made absolutely no mention of ARM in their keynote. Seems like they're trying to whitelabel it for brand benefits as well as to divorce themselves from any expectations around standards?


That was interesting! Surely not an accident. Possibly to:

- Emphasise the breadth of their silicon expertise across CPU / GPU / Neural Engines etc. - Because Arm has little or no brand recognition (Apple > Intel > Arm in branding terms). - Distinguish from any me-too moves to Arm by competitors.


There you go! Someone finally figured it out. Apple is moving to Apple Silicon, not ARM. Try to get LG to announce they’re offering a PC with Apple Silicon tomorrow.

The absence of any ARM mention is marketing, nothing more.


Which competitors? Windows and ChromeOS have already been sold on ARM hardware.


They scarcely ever have with regards to iOS either; architecture has never been a talking point for their CPUs. How long was it from iPhone announcement to knowing it was ARM? How long from the Apple A4 announcement to knowing it was ARM?


Yep, most of that old "cruft" is essentially unused and turned off. A log of critics of x86 don't really know what they're talking about. x86 is inefficient because they don't really have much incentive to end the status quo where performance is more important than power to most customers. People are happy with 3-4 hours out of their laptops so Intel and AMD aim for that and sacrifice power for performance, quit often that is the tradeoff in the design.


> People are happy with 3-4 hours out of their laptops so Intel and AMD aim for that and sacrifice power for performance

Heck, I'm happy with 1 hour. I leave my laptop plugged in nearly 100% of the time. The point of the laptop is that it's easy to move, not that I want to use it while I'm in transit.


Intel missed the whole move of personal computing to mobile devices. Not missing out on that should have been incentive enough one might think.


Some is turned off but some still has to be dealt with (variable instruction lengths for example).

Intel tried to compete in mobile for a long time and failed even with a better manufacturing process.


> Some is turned off but some still has to be dealt with (variable instruction lengths for example).

Modern x86_64 processors don't actually natively execute x86 instructions, they translate them into the instructions the hardware actually uses. The percentage of the die required to do that translation is small and immaterial.

> Intel tried to compete in mobile for a long time and failed even with a better manufacturing process.

Intel didn't understand the market.

I recently bought a new phone. On paper it's twice as fast as my old phone. I imagine that's true but I can't tell any difference. Everything was sufficiently fast before and it still is. I never use my phone to do anything that needs an actually-fast CPU. I have no reason to pay additional money for a faster phone CPU. But I do notice how often I have to charge the battery.

These are not atypical purchasing criteria for mobile devices, but that's not the market Intel was chasing with their designs and pricing, so they failed. It's not because they couldn't make an x86 CPU for that market, it's because they didn't want to, because it's a lower margin commodity market.


Faster cpus become more power-efficient cpus because they can race to sleep. So you really do want to pay more for that cpu, but not for the compute performance but for the battery life.

https://en.wikichip.org/wiki/race-to-sleep


That's assuming the faster CPUs use the same amount of power. It's possible for a slower CPU to have better performance per watt. This is often exactly what happens when you limit clock speed -- performance goes down, performance per watt goes up.


> Intel tried to compete in mobile for a long time and failed even with a better manufacturing process.

They didn't fail because of performance, though, they failed because of app support & lack of a quality radio. The CPU performance & efficiency itself was otherwise fine. It wasn't always chart-topping good, but it wasn't bad either.


Agreed - CPUs (at the end at least) were fine. Also they were probably looking for bigger margins than were available.

General point is that I think that Arm has a small architectural advantage due to lack of cruft but that other factors are usually more important - e.g. the resources and quality of team behind implementation.


Sorry meant A64 rather than aarch64 as pretty sure that Apple hasn't supported 32 bit for a while now (so no T32 or Thumb) so the instruction set was announced in 2010 and definitely cleaner than x86.

Agreed that CISC vs RISC is very questionable by now.


Provided that the software is correctly written, ARM's weaker memory model allows for more flexible instruction and I/O scheduling.

It seems most people feel that the DEC Alpha went too far in weakening the memory model to improve performance, but A64 seems to at least be near the sweet spot.

It's also not a huge amount of work that gets thrown away decoding x86 instructions in parallel, but there's non-zero overhead introduced by having the start location of the next instruction depend on what the current instruction is.


The weaker memory model also uncovers synchronization bugs that have been papered over by x86's stronger semantics ;)


Apple's recent aarch64 implementations don't support any of the 32 bit ARM instruction sets, and aarch64 is a significant departure from armv7


Your wording makes it sound like ARM is still being used just for smaller devices and controllers with very well defined and limited uses. General purpose computing is already possible with iPads and iPhones. They're just artificially limited by the OS.

iDevices weren't really made with games in mind, but they can push out performance that beats handheld gaming devices. Artists (including myself) use iPads extensively and the response time with the Apple Pencil beats out just about anything else on the market. The only limiting factor is the tiny memory that limits the file size and layer limit on some programs. They're just fine for watching video, and even multitasking with a video playing while working on something else. This is on tiny device with no active cooling and long battery life, beating out most laptops in the same price range.

I don't believe there is any curated use case. They're already more than capable of being general purpose computers. I mean, Apple is already openly advertising that they're making iPad OS more desktop-like and operable with mice and keyboards. Literally the only things holding them back are the OS and Apple's refusal to put some decent memory inside.


The OS is the curated use case. Multitasking is an afterthought. Once the OS is no longer "holding them back" the Apple chip will run into similar problems that Intel CPUs run into.


Here's a counterexample (stretching your analogy a bit): Ants lifting the heaviest car in the world.

[1]https://www.top500.org/news/japan-captures-top500-crown-arm-... [2]https://news.ycombinator.com/item?id=23601098


They already have a mobile chip that is as fast as an active-thermally cooled notebook chip.


Isn't that for specific benchmarks though such as some geekbench/specint or web browsing benchmarks? I worry about non-gpu floating point for example. There is so much hand-optimized AVX/SSE code out there in big apps.


There’s a fair bit of AVX/SSE code out there, but these days the vast bulk of AVX/SSE code is generated by the autovectorizer and that’s mostly going to work on NEON without a hitch. Clang enables the autovectorizer at -O2 by default.

I’d be interested in estimates of how much hand-written AVX/SSE your computer actually runs. The apps I’ve seen usually have a fairly small core of AVX/SSE code.


They're admittedly not applications most users run every day, but many multimedia applications (audio processing, encoding, decoding) is mostly done with hand-crafted instrinsics, the same goes for video stuff.

In an even more niche area (high-end VFX apps, like compositors, renderers) SSE/AVX intrinsics are used quite a bit in performance-critical parts of the code, and auto-vectorisers can't yet do as good a job (they're pretty useless at conditionals and masking).


Even less esoteric: your libc likely has at least a half dozen vectorized functions for the mem* and str* functions.


But is the bulk of AVX code by time spent running, code that was generated by autovectorizer? The SIMD in openssl and ffmpeg is written by hand. I bet the code that spends a lot of time on the CPU, especially the code that runs a lot while humans are waiting, is written by hand.


Those should have AArch64 versions written. AArch64 is old now, it's not some niche architecture.


Desktop productivity content creation apps have never before needed ARM versions, so many probably don't have ARM specific optimizations, and some probably have x86 specific code that is just enabled by default.

The memory model differences are going to be painful to debug, I think ("all-the-world's-a-VAX syndrome" is now "all the world's a Pentium/x86-64").


"As fast" on specific curated use cases. Show me an Apple chip that beats any laptop on 7zip.


I don’t know about 7zip specifically, but the iPad Pro seems to beat even some MacBook Pros on some benchmarks.

https://www.macrumors.com/2020/05/12/ipad-pro-vs-macbook-air...


Given that Amazon was able to get there, what makes you think Apple can't? I would struggle to believe that Annapurna Labs has any significant advantage over PA Semi given the track record PA has had since joining Apple, and the fact they had nearly a decade head-start.

https://www.anandtech.com/show/15578/cloud-clash-amazon-grav...


not sure, what they are measuring though - random Xeon from spec.org: https://www.spec.org/cpu2006/results/res2017q3/cpu2006-20170... ; at most 30% higher frequency, yet twice as fast. Well...


Is this a joke, what kind of usage benchmark is 7zipping large numbers of files?


>>> They already have a mobile chip that is as fast as an active-thermally cooled notebook chip.

>> "As fast" on specific curated use cases. Show me an Apple chip that beats any laptop on 7zip.

> Is this a joke, what kind of usage benchmark is 7zipping large numbers of files?

A benchmark that Apple is unlikely to have implemented specific optimizations for, which therefore is a better test of the general purpose performance of the chip.

The situation being claimed here is sort of like if someone cited a DES benchmark to claim that Deep Crack's DES cracking chips (https://en.wikipedia.org/wiki/EFF_DES_cracker) were faster than a contemporary 1998 Pentium II.


A benchmark that probably relies as much on disk access speeds as CPU?


Nope, 7zip is using LZMA algorithm for compression, which is around a few MB/s on the fastest CPU. It's heavily CPU bound.

edit: Just tried compressing a large file, ultra setting on my desktop i5 CPU, it's running at 3 MB/s on 1 core.


Apple ships a framework for doing lzma and other compression algorithms. I doubt they will be taken by surprise


One large file could be CPU bound, many small files (which is partly why you zip/jar things up) is disk bound.


Not true. the bottleneck is going to be compression not disk access.


> A benchmark that probably relies as much on disk access speeds as CPU?

If true, that's just a nitpick that doesn't affect the overall point of the GGGP, though.


I think the point is that it won’t be as fast in applications that Apple didn’t anticipate.


I believe it is what the poster you are replying to would call "a specific curated use case."

(Semi-seriously, I don't know anyone who uses a Unix(-like) system who uses 7zip, although I'm sure they're out there. For the record, I just unzipped a 120M archive on both my 2020 Core i7 MacBook Air and my 2018 (last-gen) iPad Pro and as near as I can tell the iPad was faster actually extracting the files, but had an extra second or so of overhead from the UI.)


> I don't know anyone who uses a Unix(-like) system who uses 7zip

7zip is an implementation of LZMA, like xz. So, different names and file format details, but essentially the same algorithm.


Correct. 7zip is a LZMA compressor. The common equivalent command line tool on Linux is xz.

Linux distributions have been using xz compression for all packages (replacing gzip). So to the question of how relevant is xz/lzma/7zip performance to day to day task, is it's a lot relevant.

The successor will probably be zstd in the coming years. https://www.phoronix.com/scan.php?page=news_item&px=Fedora-3...


You can find a 7zip for UNIX here [1].

[1] http://p7zip.sourceforge.net/


Actually the zip algorithm is a perfect candidate for dedicated silicon of a dedicated instruction.

I read about a chip that had that feature yesterday but I can’t find the link unfortunately.


this is something that's looking at getting moved to storage controllers on motherboards, ex: PS5/Xbox consoles so that compressed data can be streamed directly to the GPU. Hopefully we'll start to get this type of tech after it's been proven in the console space.


Intel QAT supports gzip compression.



so could you provide some actual, open-source basic benchmarks. And not strange, opaque geekbench results...

I guess AMD is fine for me (as is my old Intel-notebook) and I'll just wait for POVray, GROMACS and Co.

EDIT: And well, I noticed, supposedly Anandtech ran SPECint2006 on the A13 (and numerous other chips) - they ran it with WSL for x86 (because running on a dozen Android things is easier than running a standard benchmark on Linux/Windows ofc). You find the results here: https://images.anandtech.com/doci/14892/spec2006-global-over... - I guess (not sure, because it's for some reason not cleary marked and mentioned...) these are SPECint2006-results. So, let's check them for validity (because WSL is no problem and it matches Linux ofc); just looking at an i7-6700K (which is a little bit behind the i9-9900K they supposedly ran on): https://www.spec.org/cpu2006/results/res2016q1/cpu2006-20160... - marginal worse performance than @anandtech in some benchmarks, but that's with an older CPU and an older arch! And then there are the 3 or 4 benchmarks which are just way off. Makes one wonder, what they really did (because of course, installing CentOS and running SPEC on native Linux is too much of a hassle, when running and compiling on 8 ARM-platforms!?)

EDITEDIT: it's even worse for SPECfp2006: https://www.spec.org/cpu2006/results/res2016q1/cpu2006-20160... [well, here the old 6700K suddenly sometimes is twice as fast as the 9900K and 3x as fast as the A13 (and yeah the story of the 2.8GHz-low-power-chip drawing circles around a 4.5GHz-HF-part just didn't sound convincing in the first place...)]


The official results from spec.org have a bunch of cheating, eg. exploiting undefined behaviour to run a benchmark improperly. AnandTech uses a consistent compiler (Clang, not ICC) without settings to exploit this, hence the divergence.

Andrei mentions this here: https://www.realworldtech.com/forum/?threadid=187314&curpost...


When I started at Google, I sat next to a guy who used to write compilers for DEC and Intel. I asked him, given the huge amount Google spends on hardware and electricity, if he thought that switching to ICC was worth while. His answer was basically that ICC is tuned to maximally exploit undefined behavior for marketing purposes and he wouldn't want to use it in production, at least without heavily tweaking flags to disable some optimizations. ICC gets most of its speed advantages by enabling optimizations that are present in GCC/Clang, but deemed too dangerous to turn on by default.


clang with "-Ofast"; I really wonder, which other options are there?! And imho this just doesn't explain the 200% difference?

And some other things which are bugging me:

- can you only optimize for A53 on Android with the big.LITTLE-configurations (they do!)?

- so they cross-compiled an AMD64-producing gcc 3.2 with Xcode 10 on MacOS-X for ARM? impressive.


Yeah, last time they did that with Nvidia graphics cards just as Adobe released it’s new rendering engine everybody was really thrilled to learn they could also buy Apple video editing software that would not sht itself instead of using the Adobe tools because of that inherent Apple advantage...

E.g., a nice article from 2010: https://nofilmschool.com/2010/07/apple-snubs-adobe-again-wit...


Intel can't shrink the die size to what TSMC/Global Foundries/Samsung can and they will never let them manufacture due to IP/national security/etc reasons.


Are you saying that Amazon AWS's Graviton 2 EC 2 servers can't handle server level performance?


I wonder how that will pan out for the people running MS Office all day on their Macs...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: