ISA Change

Assume load operations are 20% of the total instructions for your workload and stores 10%. You are considering eliminating the immediate from the load and store instructions from a MIPS-like instruction set. That would replace

LW	R1, offset(R2)

with

ADDI	R2, R2, offset
LW		R1, (R2)

But 40% of the loads and stores use an offset of 0, which allows replacing

LW		R1, 0(R2)

with just

LW		R1, (R2)

Organization

This change will allow you to combine the EX and DM stages of the 5-stage MIPS pipeline. Draw the organization for the new 4-stage pipeline.

Forwarding

Complete a pipeline execution timeline for the original 5-stage MIPS pipeline with forwarding for the following sequence of instructions. If forwarding occurs, indicate the stages involved.

LD		R1, 0(R2)
ADD	R3, R1, R1

Complete a pipeline execution timeline for the new 4-stage pipeline with forwarding for the following sequence of instructions. If forwarding occurs, indicate the stages involved.

LD		R1, (R2)
ADD	R3, R1, R1

Speedup

According to data collected with your application using the original ISA, 10% of loads have a load-to-ALU data hazard that the compiler does not remove. Assuming no change in clock speed, use this data, together with the other data given above, to find the overall expected speedup of the 4-stage design over the original 5-stage MIPS architecture.

MIPS R4000 Pipeline

The MIPS R4000 processor has an eight-stage pipeline, with stages shown below

R4000 pipeline stages: IF IS EX DF DS TC WB

Speedup

What is the ideal pipeline speedup for this processor?

Forwarding

For data forwarding, how many additional inputs are needed for the multiplexers at the inputs to the ALU? Where does each come from? What kind of hazard(s) do these address?

Branching options

The branch delay is 3 cycles. If branches make 20% of the total instruction mix and 14% of the branches are taken, evaluate the effective CPI and adjusted pipelines speedup for each of these options:

  1. Stall for three cycles
  2. Expose one delay slot and stall for two when the delay slot can be filled 60% of the time and the resulting computation is useful 80% of the time
  3. Expose two delay slots and stall for one cycle where the first slot statistics are the same as (b), and the second slot is filled 10% of the time, with useful computation 40% of the time.
  4. Expose one delay slot and predict not taken for the other two cycles (what was really done).

Branches in practice

Given the strategy in (d), which of the following instructions are executed when the branch is taken? Which are executed when the branch is not taken?

		BNEZ	R5, Target
		SUB		R1, R2, R3
		ADD	R1, R1, R4
Target: 	SUB		R3, R3, R1
		ADD	R2, R1, R4
		ADD	R2, R2, R3

Note that the first SUB instruction is in the branch delay slot, you do not need to try to fill the delay slot yourself.