Transformation Processing

  1. Dense tables = good
    1. Select = filter data from big table into dense small table
      1. Create & use condition table if filtered often
      2. Faster to process, but must keep up to date
      3. Think of like database index
    2. Join
      1. Trivial
        • Iterate through both O(n2)
          for i in table1
              for j in table2
                  if table1.key[i] == table2.key[j] doSomething()
      2. Join by lookup
        • Iterate through one, lookup in other O(n)*O(lookup)
          for i in table1
              if (table2.lookup(table1.key[i]))
                  doSomething()
      3. Sorted indices
        1. Can have multiple (like database indices)
        2. Must keep up to date
          1. Update sort before use
          2. Insertion sort on write
          3. Ray casting example
        3. Join like merge sort O(n) (not counting update cost)
          while i in table1 & j in table2
              if table1.key[i] == table2.key[j]
                  doSomething();
                  ++i; ++j
              else if table1.key[i] < table2.key[j] ++i;
              else if table1.key[i] > table2.key[j] ++j;
      4. Join cache
        • doSomething() = output joined table
        • Process joined table
  2. Transformation
    1. Operations as transformation
      1. Walk through data, perform operation
      2. Walk data, operate, modify data
      3. Walk data, operate, generate
    2. When to generate & process separately
      1. Processing kills table cache
      2. Need to use more than once
        1. Playing sound example
        2. Especially if xform(data) adds, xform(result) removes
      3. Later processing needed
        1. Deferred shading: find closest, then do computation
        2. Delete list: don't delete until done
        3. Sort needed
        4. Multiple cores
          • One generates work, one or more process it
          • Works well with stream processing
      4. Generation of derived data structures
        1. Spatial partitioning
        2. Secondary index
        3. Alternate representation
          1. Scene graph to render lists
          2. Geometric data structure to vertex & face array
          3. Book loves Conc-tree
      5. Maps well to map and reduce
        1. Map finds data
        2. Reduce operates on filtered data
          1. Typically on subsets of the data (associative)
          2. Prefix sum was reduce operation
          3. Parallelize to lg(N) steps
        3. Large scale parallel map/reduce == hadoop
  3. Transformation types
    1. A-B Transform
      1. Stateless
      2. A read-only, B write-only
      3. Stream processing
      4. Double-buffer rather than in-place
      5. Process in any order
    2. In-place transform
      1. Stateless
      2. One to one transform
      3. Many to one would introduce ordering/race conditions
    3. Generative transform
      1. Procedure to fill table
      2. Generate based on algorithm
      3. Example: terrain
    4. Taxonomy
      # input tables
      01N
      out 0codeselect/execjoin/exec
      1generativeA-Bjoin/cache
      Ngenerativesplitjoin/split