Into the Core: Intel's next-generation microarchitecture

Wednesday, April 05, 2006

The front end: branch prediction

For reasons of both performance and power efficiency, one of the places where Intel spent a ton of transistors was on Core's branch predictor.

As the distance (in CPU cycles) between main memory and the CPU increases, putting precious transistor resources into branch prediction hardware continues to give an ever larger return on investment. This is because when a branch is mispredicted, it takes a relative eternity to retrieve the correct branch target from main memory; during this lengthy waiting period, a single-threaded processor must sit idle, wasting execution resources and power. So good branch prediction isn't just a matter of performance, but it's also a matter of conserving power by making the most efficient possible use of processor cycles.

Core essentially uses same three-part branch predictor developed for the Pentium M. I've previously covered the Pentium M's branch predictor in some detail, so I'll just summarize the features here.

At the heart of Core's branch prediction hardware are a pair of predictors, one bimodal and one global, that record information about the most recently executed branches. These predictors tells the front end how likely the branch is to be taken based on its past execution history. If the front end decides that the branch is taken, it retrieves the branch's target address from the branch target buffer (BTB) and begins fetching instructions from the new location.

Core's two bimodal and global predictors aren't the only branch prediction structures that help the processor decide if a branch is taken or not taken. The new architecture also uses two other branch predictors that were first introduced with the Pentium M: the loop detector and the indirect branch predictor.

The loop detector

Loop exit branches are only taken once (when the loop terminates), which means that they're not taken a set number of times (i.e., the duration of the loop counter). The branch history tables used in normal branch predictors don't store enough branch history to be able to correctly predict loop termination for loops beyond a certain number of iterations, so when the loop terminates they mispredict that it will keep going based on its past behavior.

The loop detector monitors the behavior of each branch that the processor executes in order to identify which of those branches are loop exit conditions. When a branch is identified as a loop exit, a special set of counters is then used to track the number of loop iterations for future reference. When the front-end next encounters that same loop exit branch, it knows exactly how many times the loop is likely to iterate before terminating. Thus it's able to correctly predict the outcome of that branch with 100 percent accuracy in situations where the loop's counter is the same size.

Core's branch prediction unit (BPU) uses an algorithm to select on a branch-by-branch basis which of the branch predictors described so far (bimodal, global, loop detector) should be used for each branch.

The indirect branch predictor

Because indirect branches load their branch targets from a register, instead of having them immediately available as is the case with direct branches, they're notoriously difficult to predict. Core's indirect branch predictor is a table that stores history information about the preferred target addresses of each indirect branch that the front end encounters. Thus when the front-end encounters an indirect branch and predicts it as taken, it can ask the indirect branch predictor to direct it to the address in the BTB that the branch will probably want.

« Prev

[Instruction fusion]

[Memory disambiguation: the data stream version of speculative execution]

Into the Core: Intel's next-generation microarchitecture

The front end: branch prediction

The loop detector

The indirect branch predictor

« Prev

Next »