Stream Processing

Streaming parallelism

aka SPMD = Single Program Multiple Data
What slows down parallel processing?
1. Communication
  1. Communication requires synchronization/blocking between processes
  2. Often reduces speed to slowest processor
  3. Streams are independent so don't need to communicate
2. Locks, Mutexes & Semaphores
  1. Synchronization primitives control access to data
  2. Block until access is granted
  3. Streams are independent so don't need to synchronize
Complications
1. Can share read data, but avoid writes to same cache block
  1. GPU SM shares cache, so still want locality
  2. CPU chunks within a thread still want cache locality
2. Variable result production

Prefix sum / scan

Example:

Thread	Data	Prefix	Total	Grouped
0	4	0	0	0
1	6	4	4	4
2	6	4+6	10	(4+6)
3	2	4+6+6	16	(4+6)+6
4	7	4+6+6+2	18	((4+6)+(6+2))
5	6	4+6+6+2+7	25	((4+6)+(6+2))+7
6	5	4+6+6+2+7+6	31	((4+6)+(6+2))+(7+6)
7	5	4+6+6+2+7+6+5	36	((4+6)+(6+2))+((7+6)+5)
-	-	4+6+6+2+7+6+5+5	41	((4+6)+(6+2))+((7+6)+(5+5))

Use for variable data production
1. 1 pass: figure out how much data each thread will produce
2. 2 lg(P) passes: parallel scan to find output start point per thread
3. 1 pass: generate data
Segmented/partitioned scan
- Extra restart boolean

Producer/Consumer
1. Queue
2. Ring buffer
3. Many producers? Many consumers?
Lockless/Lock-free data structures
1. May still block (compare to wait-free)
2. Read/modify/write atomics (e.g. compare and swap)
3. Memory fence (compiler needs to know!)