Stream Processing

  1. What is it?
    1. When processing of elements are independent
    2. Order of processing doesn't matter
      1. Process in cache order (why components can help perf)
      2. Parallel CPU threads (parallel_for)
      3. Why GPUs are fast
  2. CPU parallel
    1. See Intel TBB or similar
    2. One thread per core, issue tasks to task manager
    3. parallel_for
      1. N processors executing M chunks of L iterations (for M*L total)
      2. Automatic or controlled chunking
      3. ~1M cycles to be worth it
  3. GPU parallel
    1. N "Streaming Multiprocessors" (SM) of M blocks w/ L SIMD cores executing K interleaved threads (making L*K thread effectively SIMD "warps")
    2. NVIDIA "Maxwell"
      1. 16 SMs
      2. 4 blocks/SM
      3. 32 cores/block
      4. 1-32 threads/core (up to 1024 threads/warp)
      5. == 2048 cores, up to 2 MiThreads in parallel
    3. Interleaving hides (some) memory latency
  4. Streaming parallelism
    1. aka SPMD = Single Program Multiple Data
    2. What slows down parallel processing?
      1. Communication
        1. Communication requires synchronization/blocking between processes
        2. Often reduces speed to slowest processor
        3. Streams are independent so don't need to communicate
      2. Locks, Mutexes & Semaphores
        1. Synchronization primitives control access to data
        2. Block until access is granted
        3. Streams are independent so don't need to synchronize
    3. Complications
      1. Can share read data, but avoid writes to same cache block
        1. GPU SM shares cache, so still want locality
        2. CPU chunks within a thread still want cache locality
      2. Variable result production
    4. Prefix sum / scan
      1. Sum of elements across processors with lower TID
      2. Example:
        ThreadDataPrefixTotalGrouped
        040 0 0
        164 4 4
        264+6 10(4+6)
        324+6+6 16(4+6)+6
        474+6+6+2 18((4+6)+(6+2))
        564+6+6+2+7 25((4+6)+(6+2))+7
        654+6+6+2+7+6 31((4+6)+(6+2))+(7+6)
        754+6+6+2+7+6+5 36((4+6)+(6+2))+((7+6)+5)
        --4+6+6+2+7+6+5+541((4+6)+(6+2))+((7+6)+(5+5))
      3. Use for variable data production
        1. 1 pass: figure out how much data each thread will produce
        2. 2 lg(P) passes: parallel scan to find output start point per thread
        3. 1 pass: generate data
      4. Segmented/partitioned scan
        • Extra restart boolean
    5. Producer/Consumer
      1. Queue
      2. Ring buffer
      3. Many producers? Many consumers?
    6. Lockless/Lock-free data structures
      1. May still block (compare to wait-free)
      2. Read/modify/write atomics (e.g. compare and swap)
      3. Memory fence (compiler needs to know!)