CMSC 411 Lecture 28, Multiprocessors

    <- previous    index    next ->

Lecture 28, Multiprocessors

Classic problems that require multiprocessors:





Maxwell's Equations


The numerical solution of Maxwell's Equations for electro-magnetic
fields may use a large four dimensional array with dimensions
X, Y, Z, T. Three spatial dimensions and time.
Relaxation algorithms map well to a four dimensional array of
parallel processors.

A 4D 12,288 node supercomputer

A multiprocessor may have distributed memory, shared memory or a
combination of both.



For the distributed memory and the shared memory multiprocessors,
one possible connection, shown as a line above, is to use an
omega network. The basic building block of an omega network is
a switch with two inputs and two outputs. When a message arrives
at this switch, the first bit is stripped off and the switch is
set to: straight through if the bit is '0' on the top input or
'1' on the bottom input else cross connected. Note that only
one message can pass, the other being blocked, if two messages
arrive and the exclusive or of the first bits is not '1'.



Then omega networks for connecting two devices, four devices or
eight devices are built from this switch are shown below. The
messages are sent with the most significant bit of the destination
first.



For 16 devices connected to the same or different 16 devices,
the omega network is built from the primitive switch as:



Note that connecting N devices requires N log_2(N) switches.
Given a set of random connections of N devices to N devices
with an omega network, this is mathematically a permutation,
then statistically 1/2 N connections may be made simultaneously.


Then, we can call a CPU-memory pair a node, reduce the drawing
of a node to a dot, and show a few connection topologies
for multiprocessors



"Ports" is the number of I/O ports the node must have.
"Max path" is the maximum number of hops a message must take
in order to get from one node to the farthest node. A message
may be as small as a Boolean signal or as large as a big
matrix.

The actual interconnect technology for those lines between
the nodes has great variety. The lowest cost is Gigabit Ethernet
while the best performance is with Myrinet and Infiniband.




Now, the change 6 years later November 2012 
Interconnect Top 500   Count  Share (%)	

Gigabit Ethernet        159    31.8
Infiniband QDR	        106    21.2
Infiniband               59    11.8
Custom Interconnect      46     9.2		
Infiniband FDR	         45     9.0		
10G Ethernet	         30     6.0		
Cray Gemini interconnect 15     3.0	
Proprietary              11     2.2		
Infiniband DDR		  9     1.8	
Aries interconnect 	  4     0.8	
Infinband DDR 4x	  4     0.8	
XT4 Internal Interconnect 4     0.8	
Tofu interconnect         3     0.6	
Myrinet 10G		  3     0.6  	
Infiniband QDR Sun M9     1     0.2  new 100Gb/sec Ethernet
Mellanox 100G

One measure of a multiprocessors communication capability is
"bisection bandwidth". Optimally choose to split the processors
into two equal groups and measure the maximum bandwidth that
may be obtained between the groups.

Many modern multiprocessors are "clusters." Each node has a CPU,
RAM, hard drive and communication hardware. The CPU may be dual
or quad core and each CPU is considered a processor that may be
assigned tasks. There is no display, keyboard, sound or graphics.
The physical form factor is often a "blade" about 2 inches thick,
8 inches high and 12 inches deep with slide in connectors on the back.
A blade may have multiple CPU chips each with multiple cores.
40 or more blades may be on one rack. Upon power up, each blade
loads its operating system and applications from its local disk.

There is still a deficiency in some multiprocessor and multi core
operating systems. The OS will move a running program from one
CPU to another rather than leave a long running program and its
cache contents on one processor. Communication between multiprocesses
may actually go out of a communication port and back into a
communication processor when the processors are physically connected
to the same RAM, rather than use memory to memory communication.

Another classification of multiprocessors is:
SISD Single Instruction Single Data (e.g. old computer)
SIMD Single Instruction Multiple Data (e.g. MASSPAR, CELL, GPU)
MIMD Multiple Instruction Multiple Data (e.g. cluster)

GPU stands for graphics processing unit, e.g. your graphics
card that may have as many as 500 cores. Some of these cards
have full IEEE double precision floating point in every core.
There may be groups of cores that are SIMD and thus a group
may be MIMD. 

There are three main problems with massively parallel multiprocessors:
software, software and software.

The operating systems are marginally useful for multiprogramming where
a single program is to be run on a single data set using all the nodes
and all the memory. Today, the OS is almost no help and the programmer
must plan and program each node and every data transfer between nodes.

The programming languages are of little help. Java threads and
Ada tasks are not guaranteed to run on individual processors.
Posix threads are difficult to use and control.
MPI and VPM libraries allow the programmer to specifically allocate
tasks to nodes and control communication at the expense of significant
programming effort.

Then there are programming classifications:
SPSD Single program Single Data (Conventional program)
SPMD Single Program Multiple Data (One program with "if" on all processors) 
MPMD Multiple Program Multiple Data (Each processor has a unique program)

MPI Message Passing Interface is one of the SPMD toolkits that make
programming distributed memory multiprocessors practical,
yet still not easy.

There is a single program that runs on all processors with the allowance
for if-then-else code dependent on processor number. The processor
number may also be used for index and other calculations.
My CMSC 455 lecture on MPI

For shared memory parallel programming, threads are used, with
one thread typically assigned to each cpu.

Only a small percent of application are in the class of
"embarrassingly parallel". Most applications require significant
design effort to obtain significant "speedup".


Yes, Amdahl's law applies to multiprocessors.
Given a multiprocessor with N nodes, the maximum speedup
to be expected compared to a single processor of the same type
as the node, is N. That would imply that 100% of the program
could be made parallel.

Given 32 processors and 50% of the program can be made fully parallel,
25% of the program can use half the processors and the rest of the program
must run sequentially, what is the speedup over one sequential processor?

Time sequentially is 100%                                     100%
                         50%   25%   25%            speedup = ------ = 3.55
Time multiprocessing is  --- + --- + --- = 28.125%            28.125%
                         32    16    1

far from the theoretical maximum of 32!

Note: "fully parallel" means the speedup factor is the number of processors.
      "half the processors" in this case is 32/2 = 16.
      the remaining 25% is sequential, thus factor = 1



Given 32 processors and 99% of the program can be fully parallel,

Time sequentially is 100%                              100%
                         99%   1%            speedup = ------ = 24.4
Time multiprocessing is  --- + -- = 4.1%               4.1%
                         32    1

about 3/4 the theoretical maximum of 32!


These easy calculations are only considering processing time.
In many programs there is significant communication time to
get the data to the required node and get the results to
the required node. A few programs may require more communication
time than computation time.


Consider a 1024 = 2^10 node multiprocessor.
Add 1,048,576 = 2^20 numbers as fast as possible on this multiprocessor.
Assume no communication cost (very unreasonable)
   step  action
      1  add 2^10 numbers to 2^10 numbers getting 2^10 partial sums
      2  add 2^10 numbers to 2^10 numbers getting 2^10 partial sums
    ...
2^9=512  add 2^10 numbers to 2^10 numbers getting 2^10 partial sums

         (so far fully parallel, now have only 2^19 numbers to add)

2^9+1    add 2^10 numbers to 2^10 numbers getting 2^10 partial sums
2^9+2    add 2^10 numbers to 2^10 numbers getting 2^10 partial sums
  ...
2^9+2^8  add 2^10 numbers to 2^10 numbers getting 2^10 partial sums

         (so far fully parallel, now have only 2^18 numbers to add)

see the progression:
2^9 + 2^8 + 2^7 + ... 2^2 + 2^1 + 2^0 = 1023 time steps
         and we now have 2^10 partial sums, thus only 2^9 or 512
         processors can be used on the next step

1024    add 2^9 numbers to 2^9 numbers getting 2^9 partial sums
        (using 1/2 the processors)
1025    add 2^8 numbers to 2^8 numbers getting 2^8 partial sums
        (using 1/4 the processors)
 ...    
1033    add 2^0=1 number to 2^0=1 number to get the final sum
        (using 1 processor)

                     sequential time   1,048,575
Thus our speedup is  --------------- = ----------- = 1015
                     parallel time       1033

The percent utilization is 1015/1024 * 100% = 99.12%

Remember: Every program has a last, single, instruction to execute.
Jack Dongarra, an expert in the field of multiprocessor programming
says "It just gets worse as you add more processors."


Top 500 multiprocessors:
These have been and are evaluated by the Linpack Benchmark.
Heavy duty numerical computation. This Benchmark is close to
"embarrassingly parallel" and thus there is the start of a move
to the Graph 500 Benchmark that more fully measures the
interconnection capacity of the highly parallel machine.
Graph500

Some history of the top500:
www.top500.org/lists/2006/06
www.top500.org/list/2007/11/100
www.top500.org/lists/2008/11
www.top500.org/list/2015/06
Over 1 million cores, over 12 megawatts of power.
exascale

Gemini interconnect trying to solve the biggest problem

Latest VA Tech Machine

Test your dual core, quad core, 8, 12 to be sure your operating
system is assigning threads to different cores.
time_mp2.c
time_mp4.c
time_mp8.c
time_mp12.c
time_mp12_c.out

Here is a graph of Amdahl speedup for increasing number of processors,
for 50%, 75%, 90% and 95% parallel execution.
As the curves flatten out, more processors or cores are useless.



Tabular data


Project part3a hints
diff1.png
diff2.png

    <- previous    index    next ->

Lecture 28, Multiprocessors

Project part3a hints diff1.png diff2.png

Other links

Go to top