Into the Core: Intel's next-generation microarchitecture

By Jon "Hannibal" Stokes

Memory disambiguation: the data stream version of speculative execution

There's a simple reason why out-of-order processors must first put instructions back in program order before officially writing their results out to some form of programmer-visible memory (the register file or main memory): you can't modify a memory location until you're sure that all of the previous instructions that read that location have completed execution.

Consider the code fragment in the diagram below. The first line stores the number 13 in an unknown memory cell, and the next line loads the contents of the red memory cell into register A. The final line is an arithmetic instruction that adds the contents of registers A and B, and places the result in register C.


Memory aliasing

The blocks marked "A" and "B" below the code fragment show two options for the destination address of the store: either the red cell, or an unrelated blue cell. If the store ends up writing to the red cell (option A), then the store must execute before the load so that the load can then read the updated value from the red cell and supply it to the following addition instruction (via register A). If the store writes its value to the blue cell (option B), then it doesn't really matter if that store executes before or after the load, because it's modifying an unrelated memory location.

When a store and a load both access the same memory address, the two instructions are said to alias. So option A above is an example of memory aliasing, while option B is not.

David Kanter's RWT article on Core cites research that demonstrates that over 97 percent of the memory accesses in a processor's instruction window fall into the B category, where the memory accesses are to unrelated locations and therefore theoretically could proceed independently of one another. But for the sake of the remaining 3 percent of aliased memory accesses, processors like the P6 and Pentium 4 are built around a conservative set of assumptions about which memory accesses can be reordered. Specifically, no load is not allowed to be "hoisted" above a store with an undefined address, because when that store's address becomes available the processor might find that a load and store are accessing the same address (i.e., the load and store are aliased).

Because most load-store pairs don't alias, processors that play it safe like the P6 lose quite a bit of performance to false aliasing, where the processor assumes that two or more memory accesses alias when in reality they do not. Let's take a look at exactly where this performance loss comes from.

The figure below shows a cycle-by-cycle breakdown of how options A and B execute on a processor that uses conservative memory access reordering assumptions, like the P6 and the Pentium 4.


Execution without memory disambiguation

In both options, the destination address for the store instruction must first be known before any of the memory accesses can be carried out. That destination address is not available until the second cycle, which means that the processor cannot execute either the store or the load until the second cycle or later.

When the address becomes available at cycle two, if option A is in effect then the processor must wait another cycle for the store to update the red memory cell before executing the load. Then, the load executes, and it too takes an extra cycle to move the data from the red memory cell into the register. Finally, on the sixth cycle the add is executed.

If the processor discovers that option B is in effect and the accesses are not aliased, the load can execute immediately after (or even in parallel with) the store.

Intel's memory disambiguation technology attempts to identify instances of false aliasing, so that in instances where the memory accesses are not aliased a load can actually execute before a store's destination address becomes available. The figure below illustrates option B with and without memory disambiguation.


Execution with and without memory disambiguation

When option B is executed with memory disambiguation, the load can go ahead execute while the store's address is still unknown. The store, for its part, can just execute whenever its address becomes available.

Reordering the memory accesses in this manner enables the processor to execute the addition a full two cycles earlier than it would have without memory disambiguation. If you consider a large instruction window that contains many memory accesses, the ability to speculatively hoist loads above stores could save a significant number of total execution cycles.

Intel has developed an algorithm that examines memory accesses in order to guess which ones are probably aliased and which ones aren't. If the algorithm determines that a load-store pair are aliased, then it forces them to commit in program order. If the algorithm decides that the pair is not aliased, then the load may commit before the store.

In cases where Core's memory disambiguation algorithm guesses wrongly, the pipeline stalls and any operations that were dependent on the erroneous load are flushed and restarted once the correct data has been (re)loaded from memory.

By cutting down drastically on false aliasing, Core eliminates many cycles that are unnecessarily wasted on waiting for store address data to become available. It's too early to say how much of an impact on performance that memory disambiguation will have, but it is likely to be significant, especially in the case of memory-intensive floating-point code.

Conclusions

Core looks like it has what it takes to carry Intel forward for at least another five years. By focusing on single-threaded performance, Core will excel on the types of applications that will make up the vast majority of server and consumer code in the near to medium term. And because it's designed for relatively low core-count multicore, it will help the software industry gradually make the transition to multithreaded code.

Core is wide enough that I can see hyperthreading returning to Intel's desktop and server processors fairly quickly. There's no question that hyperthreading is a good way to counter the wasteful effects of memory latency, and its addition to Core will yield even more performance per watt.

Bibliography and suggested reading

Home | News | Articles | Reviews | Guides | Journals | OpenForum