How and why CPUs do “branch prediction” (2017)

saagarjha · on July 2, 2019

One fun thing to note is that the author doesn’t really talk a lot about anything from this decade, which makes sense considering how much complexity was being added for the tiny improvements towards the end. Modern branch predictors are quite complicated so they can eke out improvements in accuracy by fractions of a percent.

Jach · on July 2, 2019

Improving CPUs since ~2003 has just been incredibly hard. My favorite satire-history account remains The Slow Winter: https://scholar.harvard.edu/files/mickens/files/theslowwinte...

> "I wonder if we can make branch predictors even more accurate,” and the next day you’d start XOR’ing the branch’s PC address with a shift register containing the branch’s recent branching history, because in those days, you could XOR anything with anything and get something useful, and you test the new branch predictor, and now you’re up to 96% accuracy, ...

> When you retire in 2003, your face is wrinkled from all of the smiles, and even though you’ve been sued by several pedestrians who suddenly acquired rare paintings as hats, you go out on top, the master of your domain. You look at your son John, who just joined Intel, and you rest well at night, knowing that he can look forward to a pliant universe and an easy life. Unfortunately for John, the branches made a pact with Satan and quantum mechanics during a midnight screening of “Weekend at Bernie’s II.”

dang · on July 2, 2019

Discussed at the time: https://news.ycombinator.com/item?id=15078574

alphaBetaGamma · on July 2, 2019

In some/many cases there is code before the branch that does not influence the branch condition. In that case, would it be possible to architect the cpu such that it could execute the branch 'early'? I'm thinking of a special asm instruction like 'execute the next 4 instructions, and only then do the conditional jump'.

E.g. instead of:

  i=0
  StartLoop:
  i+=1
  do_stuff_0
  do_stuff_1
  do_stuff_2
  do_stuff_3
  if i<N goto StartLoop

we would have:

  i=0
  StartLoop:
  i+=1
  do_stuff_0
  do_stuff_1
  if i<N goto StartLoop in 2 instructions
  do_stuff_2
  do_stuff_3

projektfu · on July 2, 2019

Yes, that was made explicit in the MIPS instruction set with branch (and load) delay slots, and it's implicit in out-of-order processors. As I understand it the branch delay slot paradigm did not pan out as well as was hoped and it has not found its way into many other architectures.

gpderetta · on July 2, 2019

The issue with the branch delay slot is that it assumes there is exactly one clock cycle (I.e. one stage) between fetch and execute. This was true on some of the early 5 stage in order riscs, but hasn't been true for a while. In an high frequency in order design there are maybe a dozen stages/clock cycles which would be hard to fill with delay slots. OoO is even more complicated.

chrisseaton · on July 2, 2019

That’s called delay slots. It turns out to be so hard to work out what to put in them that they’re basically useless. Look at the massive flop that was the Itanium before you try it again!

brandmeyer · on July 2, 2019

ARMv8-M has a set of extensions for this kind of branch. There is a set of prepare-to-branch instructions as well as counted loop instructions. In order to support a variable-length delay slot, they don't have very many options. In order to handle interrupts, you still need an ordinary branch instruction from the launch point as well. So if you take an interrupt in the middle of the sequence, the ordinary branch instruction still gets executed.

prepare-to-branch <jump-target> <launchpoint-label> ... launch-label: branch <jump-target>

The utility is pretty limited, but it can help for strictly in-order machines.

amelius · on July 2, 2019

Wouldn't that be equivalent to loading i and N into registers, so that a conditional jump would be fast?

  i=0
  StartLoop:
  i+=1
  do_stuff_0
  do_stuff_1
  REG1 = i, REG2 = N
  do_stuff_2
  do_stuff_3
  if REG1<REG2 goto StartLoop

vectorEQ · on July 2, 2019

really great article. love it, super clear and consise. This kind of mechanisms are super interesting to learn more about. I'm curious to the new AMD cpus which use neural network to predict branches, how that fits into the performance and what kind of accuracy it peaks at.

ovi256 · on July 2, 2019

A brief introduction can be found in this post here:

https://chasethedevil.github.io/post/the_neural_network_in_y...

It talks about “Dynamic Branch Prediction with Perceptrons” (Jimenez 2001) which sparked the whole thing.

The AMD engineers read this paper (and probably many more!), put together a development program, and shipped it.

sufiyan · on July 2, 2019

The newest one moved to TAGE

tuananh · on July 2, 2019

reading these make me feel like i don’t know much about programming :D

rramadass · on July 2, 2019

Ha, ha.:-) A lot of us are in the same boat except that few are willing to acknowledge it openly. That said; i can suggest the following for further reading (from simple to advanced);

a) Eben Upton's write-up on "Raspberry Pi and Spectre/Meltdown" gives a very nice overview of the main features of modern processors - https://www.raspberrypi.org/blog/why-raspberry-pi-isnt-vulne...

b) Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture by Jon Stokes.

c) Computer Systems: A Programmer's Perspective by Bryant and O'Hallaron.

d) Modern Processor Design: Fundamentals of Superscalar Processors by Shen and Lipasti

vectorEQ · on July 2, 2019

programming, program execution and pc architecture, even though very closely related, don't really have to be understood depending on your goals. it's super interesting though to learn these kind of things if you are programmer, it helps a lot especially in low level code to improve performance and have a better sense of what cpu / memory are doing as a result of your code.

myNewName1001 · on July 9, 2019

Regarding fixed precision neural networks, it looks like Intel had some fairly mature work on this at ISCA -- fast inference, but have to train elsewhere in the machine:

https://arxiv.org/abs/1906.09889

jedberg · on July 2, 2019

This is a good primer for understanding the Spectre[0] issue. In a nutshell, Spectre is caused by the CPU keeping the branches in an unencrypted cache that is accessible from programs other than the one that generated the branch.

Their fix was to add an extra kernel check for unauthorized access, which is why it causes a huge performance hit. The real fix will be in the next generation of chips, which will either encrypt or at the least protect the branch cache from unauthorized access from other programs.

[0] https://en.wikipedia.org/wiki/Spectre_(security_vulnerabilit...

saagarjha · on July 2, 2019

> In a nutshell, Spectre is caused by the CPU keeping the branches in an unencrypted cache that is accessible from programs other than the one that generated the branch.

I would instead say that the branch prediction has observable side effects even when it’s thrown away; say, for example, the branch performs a memory access: it may leave things in the cache even if the processor doesn’t take the branch and hence it may be possible to recover information by running a timing attack on how long it takes to access certain things.

londons_explore · on July 2, 2019

I feel like fixed precision neural nets are probably the future of branch prediction. Both 'hardcoding' a neural net on a large amount of software and then designing the weights into the CPU in a ROM, but also dynamic approaches to deciding which prediction strategy to use for a specific branch based on state recorded by another neural net.

Approaches like this have been used as a fun exercise, but now I think we're reaching the point where single thread performance is so important, and mispredicts are getting more expensive, so it's worth taking quite a big power and area hit to get a few more percentage points of performance...

jcranmer · on July 2, 2019

> but now I think we're reaching the point where single thread performance is so important

Ever since Denard scaling stopped, the biggest boosts to performance have been in increasing the parallelization opportunity, both at the SIMD level and at the multi-core level. Admittedly, working in HPC gives me a biased view, but everyone I see has resigned themselves to processors ceasing to give meaningful boosts to single-threaded performance.

Moreover, the ceiling you could get in boosting single-threaded performance with a perfect branch predictor (for conditional branches) over current state-of-the-art is around 4%. There's just not a lot of upside to be had, and even just using the extra space for another core looks preferable at that low ceilings. Extra area for the branch predictor is likely to go to better indirect branch prediction (including virtual function calls), which are increasingly important in modern workloads, and where there is a much larger headroom for performance improvement.

I'll also add that the effectiveness of neural nets in modern machine learning contexts has come from making them big and deep, and deep means adding latency to the branch predictor, which is not what you want (especially because you want to know ASAP if you need to prefetch a different I-cache line).

redcalx · on July 2, 2019

Also, branch prediction circuitry has costs, such as higher energy per instruction on average; management of the design complexity, the risks associated with that (e.g Spectre, meltdown), increased die size and therefore faulty die rate and die cost, etc.

gpderetta · on July 2, 2019

Thing is not everything is amenable to simd or multithreaded optimization. As you said yourself, a lot of HPC code tend to parallelize more easily than average. For a lot of code out of order execution is the only way to extract some amount of parallelism, and one of the biggest limiting factors to increasing the reorder window is branch prediction rate. Even an improvement of 0.1% in the prediction rate can lead to non trivial improvements in performance.

dragontamer · on July 2, 2019

> For a lot of code out of order execution is the only way to extract some amount of parallelism

But that's the thing. A CPU is simply a parallel machine forced to accelerate "sequential" code.

A "truly sequential" code sequence, like linkedList->head->next->next->next cannot be parallelized on a modern CPU. The only stuff that can be parallelized are if-statements / loops (aka: branch prediction: try to do the future speculatively), and anything Tomasulo's algorithm happens to pick up.

Even then: the modern CPU will attempt to parallelize that linked-list access because of Cache prefetching. That's actually why Arrays work so well in today's architectures: because the L1, L2, L3, and DRAM components are working in parallel through Cache Prefetchers (and Arrays are so simple that speculation will certainly be correct).

In effect: people are writing parallel programs. They are just leaving it to the CPU to figure out the parallelism details.

X = Y + Z; A = B + C can be made parallel (independent variables).

X = Y + Z; X = D + E. Also parallel: Write-after-write hazard, so X=D+E can execute first, and just throw away X=Y+Z entirely. Etc. etc.

CPUs simply find these patterns in your code and execute them in parallel.

AMD Zen has 4-integer pipelines + 4-vector pipelines + 2 load/store units per core.

Intel has 8-pipelines of varying capabilities per core. 0, 1, 5 are vector units, but 0 is also the division pipeline. Its a bit more complicated, but its still 8x parallelism that is being fed by the front-end (and reordered into the correct order by the backend).

--------------

Basically: the parallelism does exist in the code that was written. The programmer just hasn't explicitly acknowledged it yet.

But with every "speculative" branch taken, the CPU does work, and then (maybe) throws it away. The GPU-basis of coding is to not do any speculative work at all. Instead, GPUs throw 10-threads per shader (kinda like a core) and SMT the heck out of them.

If thread#0 accesses memory, it will be stuck doing that for 300+ cycles. So thread#1 will take over the core. When thread#1 waits on memory, thread#2 takes over. Etc. etc. By the time thread#9 and #10 roll around, thread#1 probably has its memory ready.

Instead of speculatively executing thread#0, the GPU is provided with alternative work it can do. In any HPC context, there's probably "more work to do" somewhere, so the GPU can remain saturated with work to do. Its naturally a more power-efficient methodology: there's less wasted work.

It just requires the programmer to figure out a lot of tasks that the GPU can do to remain saturated. While a CPU will dedicate more and more resources, all the way towards speculative execution, to accelerate the same task. Even if its highly wasteful.

oso2k · on July 2, 2019

> Admittedly, working in HPC gives me a biased view, but everyone I see has resigned themselves to processors ceasing to give meaningful boosts to single-threaded performance.

There's also that realization that compiler & software development is still far, far behind the parallelization capabilities of the HW. See the failed DEC Alpha, Itanium. It's a bad design situation when compilers regularly generate code that blows caches, branch predictors, TLBs and instruction pipelines more than 50% of the time when the code has been run through aggressive non-PGO effort. Compilers need to get better at generating better code.

Veedrac · on July 2, 2019

> Moreover, the ceiling you could get in boosting single-threaded performance with a perfect branch predictor (for conditional branches) over current state-of-the-art is around 4%.

No, actually having perfect branch prediction is a major blocker on performance even now, since it limits the useful size of the out-of-order reorder buffer.

Symmetry · on July 2, 2019

There has been some work in making cores wider with larger re-order depth and more physical register files, etc. But all of that is deeply into diminishing returns territory.

ben-schaaf · on July 2, 2019

It seems like AMD has been doing something similar for a little while already: https://chasethedevil.github.io/post/the_neural_network_in_y...

tntn · on July 2, 2019

AMD switched to TAGE for Zen 2, so I don't know if neural networks are "the future of branch prediction" or just a neat diversion.

fooker · on July 2, 2019

Future? Most modern CPUs do this to some extent. Remember you can only have a very simple model here, because the inference time has to be extremely fast.

danharaj · on July 2, 2019

What's the state of the art of indirect jump prediction?

pjmgut · on July 2, 2019

https://hal.inria.fr/hal-00639041/document

Might be of interest

danharaj · on July 2, 2019

Thank you!

pjmgut · on July 2, 2019

If you want to see other fun uses of (IT)TAGE I'd recommend reading Arthur Persis & Seznec's work on value prediction https://www.google.com/url?sa=t&source=web&rct=j&url=https:/...