This is an architecture diagram for the ARM Cortex-A57 processor (from PC Watch by way of Anandtech). This processor can issue up to three ARM instructions per cycle (3-way Instruction Decode), and issue up to 8 micro-operations per cycle to the execution units using a variation of Tomasulo's algorithm.

ARM Cortex-A57 architecture diagram

Pipeline

Pipeline depth

What is the ideal pipeline speedup over single cycle for simple integer ALU instructions considering the architecture as just a simple pipeline (only issued one instruction per cycle, and no out of order completion)?

Multiple issue

What is the speedup over single-cycle for simple integer ALU instructions if you assume three instruction issued per cycle and no stalls or misprediction penalties?

Branching

Branch Penalty

What is the branch penalty in cycles?

Branch Prediction

Of the things in the Branch Prediction box, which are used to predict branch direction and which are used to predict branch target address?

CPI with branching

Assuming sustained issue of three instructions per cycle, and no stalls due to data hazards, what is the expected CPI if 20% of the instructions are branches and branch prediction achieves 95% accuracy?

Cache

L1 Cache addressing

What is the breakdown of a 48-bit virtual addresses into tag, index and offset for the L1 instruction cache? For the L1 data cache?

L2 Cache addressing

What is the breakdown of a 44-bit physical address into tag, index, and offset for a 2 MB L2 cache with 64-byte cache lines?

Memory access timing

Assume an A57 running at 2 GHz; L1 access time of 2 ns with a 90% hit rate; L2 hit time of 9 ns with a 95% hit rate; and memory with an access time of 154 ns. What is the average memory access time?

CPI with branching and memory

What is the total expected CPI including memory access stalls and branch penalties (from 2C) for a program with 15% loads, 2% stores?