```    <- previous    index    next ->
```

### Lecture 19, Pipelining Data Forwarding

```
Data forwarding example   CMSC 411 architecture

Consider the five stage pipeline architecture:

IF instruction fetch, PC is address into memory fetching instruction
ID instruction decode and register read out of two values
EX execute instruction or compute data memory address
WB write back value into general register

IF       ID          EX        M       WB
+--+     +--+        +--+     +--+     +--+
|  |     |  |        | A|-|\  |  |     |  |
|  |     |  |    /---|  | \ \_|  |     |  |
|PC|-(I)-|IR|-(R)  = |  | / / |  |-(D)-|  |--+
|  |     |  |  ^ \---| B|-|/  |  |     |  |  |
+--+     +--+  |     +--+     +--+     +--+  |
^        ^    |      ^   ALU  ^        ^    |
|        |    |      |        |        |    |
clk-+--------+-----------+--------+--------+    |
|                             |
+-----------------------------+

Now consider the instruction sequence:

400  lw  \$1,100(\$0)  load general register 1 from memory location 100
404  lw  \$2,104(\$0)  load general register 2 from memory location 104
408  nop
40C  nop             wait for register \$2 to get data
410  add \$3,\$1,\$2    add contents of registers 1 and 2, sum into register 3
414  nop
418  nop             wait for register \$3 to get data
41C  add \$4,\$3,\$1    add contents of registers 3 and 1, sum into register 4
420  nop
424  nop             wait for register \$4 to get data
428  beq \$3,\$4,-100  branch if contents of register 3 and 4 are equal to 314
42C  add \$4,\$4,\$4    add ..., this is the "delayed branch slot" always exec.

The pipeline stage table with NO data forwarding is:

lw   IF ID EX M  WB
lw      IF ID EX M  WB
nop        IF ID EX M  WB
nop           IF ID EX M  WB
add              IF ID EX M  WB
nop                 IF ID EX M  WB
nop                    IF ID EX M  WB
add                       IF ID EX M  WB
nop                          IF ID EX M   WB
nop                             IF ID EX M  WB
beq                                IF ID EX M  WB
add                                   IF ID EX M  WB

time 1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16

This can be significantly improved with the addition of four
multiplexors and wiring.

IF       ID                  EX          M       WB
+--+     +--+           +--+          +--+       +--+
|  |     |  |           | A|-(X)--|\  |  |       |  |
|  |     |  |    /-(X)--|  | | |  \ \_|  |       |  |
|PC|-(I)-|IR|-(R)   | = |  | | |  / / |  |-+-(D)-|  |--+
|  |     |  |  ^ \-(X)--| B|-(X)--|/  |  | |     |  |  |
+--+     +--+  |    |   +--+ | |      +--+ |     +--+  |
^        ^    |    |    ^   | |  ALU  ^   |      ^    |
|        |    |    |    |   | |       |   |      |    |
clk-+--------+--------------+-------------+----------+    |
|    |        | |           |           |
|    +----------+-----------+           |
|             |                         |
+-------------+-------------------------+

The pipeline stage table with data forwarding is:

lw   IF ID EX M  WB
lw      IF ID EX M  WB
nop        IF ID EX M  WB                 saved one nop
add           IF ID EX M  WB              \$2 in WB and used in EX
add              IF ID EX M  WB           saved two nop's \$3 used
nop                 IF ID EX M WB         saved one nop
beq                    IF ID EX M  WB     \$4 in MEM and used in ID
add                       IF ID EX M  WB

time 1  2  3  4  5  6  7  8  9  10 11 12

Note the required nop from using data immediately after a load.
Note the required nop for the beq in the ID stage using an ALU result.

The data forwarding paths are shown in green with the additional
multiplexors. The control is explained below.

Green must be added to part2a.vhdl.
Blue already exists, used for discussion, do not change.

To understand the logic better, note that MEM_RD contains the register
destination of the output of the ALU and MEM_addr contains the value
of the output of the ALU for the instruction now in the MEM stage.

If the instruction in the EX stage has the MEM_RD destination in
bits 25 downto 21, then MEM_addr must be routed to the A side of the ALU.
(This is the A forward MEM_addr control signal.)

EX stage          MEM stage
|               |
+---------------+

If the instruction in the EX stage has the MEM_RD destination in
bits 20 downto 16, then MEM_addr must be routed to the B side of the ALU.
(This is the B forward MEM_addr control signal.)

EX stage          MEM stage
|            |
+------------+

To understand the logic better, note that WB_RD contains the register
destination of the output of the ALU or Memory and WB_result contains
the value of the output of the ALU or Memory for the instruction now
in the WB stage.

If the instruction in the EX stage has the WB_RD destination in
bits 25 downto 21, then WB_result must be routed to the A side of the ALU.
(This is the A forward WB_result control signal.)

If the instruction in the EX stage has the WB_RD destination in
bits 20 downto 16, then WB_result must be routed to the B side of the ALU.
(This is the B forward WB_result control signal.)

Note that a beq instruction in the ID stage that needs a value from
the instruction in the WB stage does not need data forwarding.

A beq instruction in the ID stage has the MEM_RD destination in
bits 25 downto 21, then MEM_addr must be routed to the top side of
the equal comparator.
(This is the 1 forward control signal.)

A beq instruction in the ID stage has the MEM_RD destination in
bits 20 downto 16, then MEM_addr must be routed to the bottom side of
the equal comparator.
(This is the 2 forward control signal.)

ID stage        EX stage        MEM stage
|                            |
+----------------------------+

A beq instruction in the ID stage has the WB_RD destination in
bits 20 downto 16, then WB_result must be used by the bottom side of
the equal comparator.
(This happens by magic. Not really, two rules above apply.)

ID stage        EX stage    MEM stage    WB stage
beq \$3,\$4,-100      nop         nop       lw \$4,8(\$3)
|                                     |
+-------------------------------------+

The data forwarding rules can be summarized based on the
cs411 schematic, shown above.

ID stage beq data forwarding:

default with no data forwarding is ID_read_data_1
1 forward MEM_addr is  ID_reg1=MEM_RD and MEM_rd/=0 and MEM_OP/=lw

default with no data forwarding is ID_read_data_2
2 forward MEM_addr is  ID_reg2=MEM_RD and MEM_rd/=0 and MEM_OP/=lw

EX stage data forwarding:

default with no data forwarding is EX_A
A forward MEM_addr is  EX_reg1=MEM_RD and MEM_RD/=0 and MEM_OP/=lw
A forward WB_result is  EX_reg1=WB_RD and WB_RD/=0

default with no data forwarding is EX_B
B forward MEM_addr is  EX_reg2=MEM_RD and MEM_RD/=0 and MEM_OP/=lw
B forward WB_result is  EX_reg2=WB_RD and WB_RD/=0

Note: the entity mux32_3 is designed to handle the above.

ID_RD is 0 for ID_OP= beq, j, sw (nop, all zeros, automatic zero in RD)
thus EX_RD, MEM_RD,  WB_RD = 0 for these instructions
Because register zero is always zero, we can use 0 for
a destination for every instruction that does not
produce a result in a register. Thus no data forwarding
will occur for instructions that do not produce a value
in a register.

note: ID_reg1 is ID_IR(25 downto 21)
ID_reg2 is ID_IR(20 downto 16)
EX_reg1 is EX_IR(25 downto 21)
EX_reg2 is EX_IR(20 downto 16)
MEM_OP  is MEM_IR(31 downto 26)
EX_OP   is EX_IR(31 downto 26)
ID_OP   is ID_IR(31 downto 26)

These shorter names can be used with  VHDL alias statements

alias  ID_reg1 : word_5 is ID_IR(25 downto 21);
alias  ID_reg2 : word_5 is ID_IR(20 downto 16);
alias  EX_reg1 : word_5 is EX_IR(25 downto 21);
alias  EX_reg2 : word_5 is EX_IR(20 downto 16);
alias  MEM_OP  : word_6 is MEM_IR(31 downto 26);
alias  EX_OP   : word_6 is EX_IR(31 downto 26);
alias  ID_OP   : word_6 is ID_IR(31 downto 26);

Why is the priority mux, mux32_3 needed?
Answer: Consider MEM_RD with a destination value 3 and
WB_RD with a destination value 3.

For this to happen, some program or some person would have
written code such as:

sub  \$3,\$12,\$11
add  \$4,\$3,\$3   double the value of \$3

Well, rather obviously, the result of the  sub  is never used and
thus the answer to our question is that MEM_addr must be used. This
is the closest prior instruction with the required result. The
correct design is implemented using the priority mux32_3 with the
MEM_addr in the  in1  priority input.

The control signal  A forward MEM_addr  may be implemented in VHDL as:

Here is where you may want to add a debug process. Replace AFMA
with any signal name of interest:

prtAFMA: process (AFMA)
variable my_line : LINE; -- my_line needs to be defined
begin
write(my_line, string'("AFMA="));
write(my_line, AFMA);         -- or hwrite for long signals
write(my_line, string'(" at="));
write(my_line, now);         -- "now" is simulation time
writeline(output, my_line);  -- outputs line
end process prtAFMA;

part2a.chk has the _RD signals and values

cs411_opcodes.txt for op code values

Now, to finish part2a.vhdl, the jump and branch instructions must be
implemented. This is shown in green on the upper part of the schematic.

The signal out of the jump address box would be coded in VHDL as:

jump_addr <= PCP(31 downto 28) & ID_IR(25 downto 0) & "00";

```    <- previous    index    next ->