CPU Architecture, or why Data Oriented Programming

  1. Models
    1. CS is built on simplified models
      1. You don't need to understand differential equations of transistor behavior to program
      2. PRAM for MIMD parallel, SIMD for GPU
      3. Allow you to reason about performance
    2. Sometimes the model misses something important
      1. Code is not as fast as it could be
      2. Trying to make code faster makes it slower
    3. CPU model
      1. Executes instructions in order
      2. Operations are fixed cost
        • Or a few simple classes (simple vs. transcendental)
      3. This is pretty wrong
  2. Bandwidth vs. Latency (network terms)
    1. Latency = how long until something completes (1st byte transmission time)
    2. Bandwidth = how many per second (transmission rate)
    3. Memory
      1. Optimized for bandwidth
      2. Expensive to fetch a row, cheaper for more in row
    4. CPUs
      1. Optimized for throughput (= "bandwidth"), instructions/second
      2. Many instructions in flight
        1. Pipelining
        2. Multiple instructions issued per cycle (7 µops on Core)
        3. Multiple instructions retired per cycle
        4. Haswell: Up to 192 instructions in flight
        5. Start 2-4 simple integer operations per cycle
      3. Latency for CPUs is how long between dependent instructions
        1. Block if not ready, but keep working on other instructions
      4. Also (longer) time in flight: issue to retire
  3. Memory
    1. Stats (from chadaustin.me/2015/04/thinking-about-performance)
      Cycle: .25-.5ns
      Level Latency Equivalent Granularity Size Associativity
      Disk 10ms
      (~20M cycles)
        access by page TBs 1
      SSD 50-250 µs
      (~100k cycles)
        access by page TBs 1
      Memory ~200 cycles atan access by cache block GBs Full
      L1 3-4 cycles ALU access by word 10s of KB 1-8
      L2 10-15 cycles DIV r8 access by cache block MBs 2-8
      L3 ~40 cycles DIV r64

      access by cache block

      10s of MB 4-16
    2. What to do
      1. Disk & SSD: do something else
      2. Memory
        1. Dependent instructions wait
        2. Doesn't take too long before all instructions are dependent
      3. Cache
        1. Likely to reuse data (e.g. variables), so keep some in small fast memory
        2. Find by bottom bits of address, associativity helps with conflicts
      4. Speculation
        1. Cheap to get neighboring bytes, so do it in case you need them
        2. Recognize patterns (sequential fetch) & prefetch blocks
        3. If you're wrong, just replace them later
    3. Change in model
      1. Local, coherent memory accesses cheap
      2. Random memory access blocks all instructions in flight
    4. Change in programming
      1. Group related data
      2. Avoid unnecessary pointer dereferencing
        • C++ -> referred to as "cache miss operator"
      3. Avoid virtual inheritance
    5. Example: linear vs. binary search
      1. Linear: use full cache block, prefetch-friendly
      2. Binary: use only part of each cache block, unlikely to prefetch, asymptotically better
      3. Show data: crossover about 32
      4. Show also quadratic vs linear
    6. Secondary cache concerns
      1. Simple hash & associativity = power of 2 data access can be bad
      2. Communication with other cores synchronized at cache-block level
  4. Branching
    1. Branch stops issue until resolved
      1. Figure out branch target
      2. Is branch taken?
    2. Speculation / prediction
      1. Make a guess
      2. Issue those instructions
      3. Don't retire until you know if you were right
      4. Throw out if wrong
      5. Penalty up to all instructions in flight (on par with memory)
    3. Importance
      1. Many programs branch a lot
      2. gcc ~20% branches (1 in 5 instructions!)
    4. Strategies
      1. 1st time, no info: assume not taken
      2. Assume branch is consistent
      3. Assume branch is correlated with other branches
      4. Assume branch is a loop w/ consistent count
      5. Assume branch is a function return
    5. Change in model
      1. Predictable branches are cheap
      2. Unpredictable branches are expensive
    6. Change in programming
      1. Avoid unnecessary branching
      2. Try to make branches more predictable
  5. Algorithmic analysis
    1. O() based on large data, assumes constants don't matter
      1. Constants are a function of data access & branching
        • Changing the algorithm changes the constants
      2. Constants can be 1-200
      3. "Best" algorithm may not win for small data
    2. Change in programming
      1. Be aware of data size
      2. Choose algorithm based on data size
      3. Reorganize data to reduce constants
        • Example of implicit tree