CMSC 411 Lecture 28, Multiprocessors

    <- previous    index    next ->

Lecture 28, Multiprocessors

Classic problems that require multiprocessors:





Maxwell's Equations


The numerical solution of Maxwell's Equations for electro-magnetic
fields may use a large four dimensional array with dimensions
X, Y, Z, T. Three spatial dimensions and time.
Relaxation algorithms map well to a four dimensional array of
parallel processors.

A 4D 12,288 node supercomputer

A multiprocessor may have distributed memory, shared memory or a
combination of both.



For the distributed memory and the shared memory multiprocessors,
one possible connection, shown as a line above, is to use an
omega network. The basic building block of an omega network is
a switch with two inputs and two outputs. When a message arrives
at this switch, the first bit is stripped off and the switch is
set to: straight through if the bit is '0' on the top input or
'1' on the bottom input else cross connected. Note that only
one message can pass, the other being blocked, if two messages
arrive and the exclusive or of the first bits is not '1'.



Then omega networks for connecting two devices, four devices or
eight devices are built from this switch are shown below. The
messages are sent with the most significant bit of the destination
first.



For 16 devices connected to the same or different 16 devices,
the omega network is built from the primitive switch as:



Note that connecting N devices requires N log_2(N) switches.
Given a set of random connections of N devices to N devices
with an omega network, this is mathematically a permutation,
then statistically 1/2 N connections may be made simultaneously.


Then, we can call a CPU-memory pair a node, reduce the drawing
of a node to a dot, and show a few connection topologies
for multiprocessors



"Ports" is the number of I/O ports the node must have.
"Max path" is the maximum number of hops a message must take
in order to get from one node to the farthest node. A message
may be as small as a Boolean signal or as large as a big
matrix.

The actual interconnect technology for those lines between
the nodes has great variety. The lowest cost is Gigabit Ethernet
while the best performance is with Myrinet and Infiniband.



One measure of a multiprocessors communication capability is
"bisection bandwidth". Optimally choose to split the processors
into two equal groups and measure the maximum bandwidth that
may be obtained between the groups.

Many modern multiprocessors are "clusters." Each node has a CPU,
RAM, hard drive and communication hardware. The CPU may be dual
or quad core and each CPU is considered a processor that may be
assigned tasks. There is no display, keyboard, sound or graphics.
The physical form factor is often a "blade" about 2 inches thick,
8 inches high and 12 inches deep with slide in connectors on the back.
A blade may have multiple CPU chips each with multiple cores.
40 or more blades may be on one rack. Upon power up, each blade
loads its operating system and applications from its local disk.

There is still a deficiency in some multiprocessor and multi core
operating systems. The OS will move a running program from one
CPU to another rather than leave a long running program and its
cache contents on one processor. Communication between multiprocesses
may actually go out of a communication port and back into a
communication processor when the processors are physically connected
to the same RAM, rather than used memory to memory communication.

Another classification of multiprocessors is:
SISD Single Instruction Single Data (e.g. old computer)
SIMD Single Instruction Multiple Data (e.g. MASSPAR)
MIMD Multiple Instruction Multiple Data (e.g. cluster)

There are three main problems with massively parallel multiprocessors:
software, software and software.

The operating systems are marginally useful for multiprogramming where
a single program is to be run on a single data set using all the nodes
and all the memory. Today, the OS is almost no help and the programmer
must plan and program each node and every data transfer between nodes.

The programming languages are of little help. Java threads and
Ada tasks are not guaranteed to run on individual processors.
Posix threads are difficult to use and control.
MPI and VPM libraries allow the programmer to specifically allocate
tasks to nodes and control communication at the expense of significant
programming effort.

Then there are programming classifications:
SPSD Single program Single Data (Conventional program)
SPMD Single Program Multiple Data (One program with "if" on all processors) 
MPMD Multiple Program Multiple Data (Each processor has a unique program)

MPI Message Passing Interface is one of the SPMD toolkits that make
programming multiprocessors practical, yet still not easy. There is
a single program that runs on all processors with the allowance for
if-then-else code dependent on processor number. The processor
number may also be used for index and other calculations.
My CMSC 455 lecture on MPI

Only a small percent of application are in the class of
"embarrassingly parallel". Most applications require significant
design effort to obtain significant "speedup".


Yes, Amdahl's law applies to multiprocessors.
Given a multiprocessor with N nodes, the maximum speedup
to be expected compared to a single processor of the same type
as the node, is N. That would imply that 100% of the program
could be made parallel.

Given 32 processors and 50% of the program can be made fully parallel,
25% of the program can use half the processors and the rest of the program
must run sequentially, what is the speedup over one sequential processor?

Time sequentially is 100%                                     100%
                         50%   25%   25%            speedup = ------ = 3.55
Time multiprocessing is  --- + --- + --- = 28.125%            28.125%
                         32    16    1

far from the theoretical maximum of 32!

Note: "fully parallel" means the speedup factor is the number of processors.
      "half the processors" in this case is 32/2 = 16.
      the remaining 25% is sequential, thus factor = 1



Given 32 processors and 99% of the program can be fully parallel,

Time sequentially is 100%                              100%
                         99%   1%            speedup = ------ = 24.4
Time multiprocessing is  --- + -- = 4.1%               4.1%
                         32    1

about 3/4 the theoretical maximum of 32!


These easy calculations are only considering processing time.
In many programs there is significant communication time to
get the data to the required node and get the results to
the required node. A few programs may require more communication
time than computation time.


Consider a 1024 = 2^12 node multiprocessor.
Add 1,048,576 = 2^20 numbers as fast as possible on this multiprocessor.
Assume no communication cost (very unreasonable)
   step  action
      1  add 2^10 numbers to 2^10 numbers getting 2^10 partial sums
      2  add 2^10 numbers to 2^10 numbers getting 2^10 partial sums
    ...
2^9=512  add 2^10 numbers to 2^10 numbers getting 2^10 partial sums

         (so far fully parallel, now have only 2^19 numbers to add)

2^9+1    add 2^10 numbers to 2^10 numbers getting 2^10 partial sums
2^9+2    add 2^10 numbers to 2^10 numbers getting 2^10 partial sums
  ...
2^9+2^8  add 2^10 numbers to 2^10 numbers getting 2^10 partial sums

         (so far fully parallel, now have only 2^18 numbers to add)

see the progression:
2^9 + 2^8 + 2^7 + ... 2^2 + 2^1 + 2^0 = 1023 time steps
         and we now have 2^10 partial sums, thus only 2^9 or 512
         processors can be used on the next step

1024    add 2^9 numbers to 2^9 numbers getting 2^9 partial sums
        (using 1/2 the processors)
1025    add 2^8 numbers to 2^8 numbers getting 2^8 partial sums
        (using 1/4 the processors)
 ...    
1033    add 2^0=1 number to 2^0=1 number to get the final sum
        (using 1 processor)

                     sequential time   1,048,575
Thus our speedup is  --------------- = ----------- = 1015
                     parallel time       1033

The percent utilization is 1015/1024 * 100% = 99.12%

Remember: Every program has a last, single, instruction to execute.

Jack Dongarra, an expert in the field of multiprocessor programming
says "It just gets worse as you add more processors."


Top 500 multiprocessors:
www.top500.org/lists/2006/06
www.top500.org/lists/2006/11

    <- previous    index    next ->

Lecture 28, Multiprocessors

Other links

Go to top