Into the Core: Intel's next-generation microarchitecture

By Jon "Hannibal" Stokes

Instruction fusion

Macro-fusion

Another new feature of Core 's front end hardware is its ability to fuse certain types of x86 instructions together in the predecode phase and send them through a single decoder to be translated into a single micro-op. This feature, called macro-fusion, can only be used on certain types of instructions; specifically, compare and test instructions can be macro-fused with branch instructions. Any one of Core's four decoders can generate a macro-fused micro-op on each cycle, but no more than one such micro-op can be generated per cycle.

In addition to the new hardware that it requires in the predecode and decode phases of the pipeline, macro-fusion also necessitates some modifications to the ALU and branch execution units in the back end. These new hardware requirements are offset by the savings in bookkeeping hardware that macro-fusion yields, since there are fewer micro-ops inflight for the core to track. Ultimately, less bookkeeping hardware means better power efficiency per x86 instruction for the processor as a whole, which is why it 's important for Core to approach the goal of one micro-op per x86 instruction as closely as possible.

Besides allowing Core to do more work with fewer ROB and RS entries, macro-fusion also has the effect of increasing the front end's decode bandwidth. Core 's decode hardware can empty the instruction queue (IQ) that much more quickly if a single simple/fast decoder can take in two x86 instructions per cycle instead of one.

Finally, macro-fusion effectively increases Core's execution width, because a single ALU can execute what is essentially two x86 instructions simultaneously. This frees up execution slots for non-macro-fused instructions, and makes the processor appear wider than it actually is.

Micro-ops fusion

Micro-ops fusion, a technique that Intel first introduced with the Pentium M, has some of the same effects as macro-fusion, but it functions differently. Basically, a simple/fast decoder takes in a single x86 instruction that would normally translate into two micro-ops, and it produces a fused pair of micro-ops that are tracked by the ROB using a single entry.

When they reach the reservation station, the two members of this fused pair are allowed to issue separately, either in parallel through two different issue ports or serially through the same port, depending on the situation.

The most common types of fused micro-ops are loads and stores. Here's how I described the fused store in my original Pentium M coverage:

Store instructions on the P6 are broken down into two uops: a store-address uop and a store-data uop. The store-address uop is the command that calculates the address where the data is to be stored, and it's sent to the address generation unit in the P6's store-address unit for execution. The store-data uop is the command that writes the data to be stored into the outgoing store data buffer, from which the data will be written out to memory when the store instruction retires; this command is executed by the P6's store-data unit. Because the two operations are inherently parallel and are performed by two separate execution units on two separate issue ports, these two uops can be executed in parallel--the data can be written to the store buffer at the same time that the store address is being calculated.

According to Intel, the PM's instruction decoder not only decodes the store operation into two separate uops but it also fuses them together. I suspect that there has been an extra stage added to the decode pipe to handle this fusion. The instructions remain fused until they're issued (or "dispatched," in Intel's language) through the issue port to the actual store unit, at which point they're treated separately by the execution core. When both uops are completed they're treated as fused by the core's retirement unit.

Fused loads work similarly, although they issue serially instead of in parallel.

Like macro-fusion, micro-ops fusion enables the ROB to issue and commit more micro-ops using fewer entries and less hardware. It also effectively increases Core's decode, allocation, issue, and commit bandwidth above what it would normally be. This makes Core more power efficient, because it does more with less hardware.

Home | News | Articles | Reviews | Guides | Journals | OpenForum