<- previous    index    next ->

Lecture 28, Multiprocessors

Classic problems that require multiprocessors:

Maxwell's Equations

The numerical solution of Maxwell's Equations for electro-magnetic
fields may use a large four dimensional array with dimensions
X, Y, Z, T. Three spatial dimensions and time.
Relaxation algorithms map well to a four dimensional array of
parallel processors.

A 4D 12,288 node supercomputer

A multiprocessor may have distributed memory, shared memory or a
combination of both.

For the distributed memory and the shared memory multiprocessors,
one possible connection, shown as a line above, is to use an
omega network. The basic building block of an omega network is
a switch with two inputs and two outputs. When a message arrives
at this switch, the first bit is stripped off and the switch is
set to: straight through if the bit is '0' on the top input or
'1' on the bottom input else cross connected. Note that only
one message can pass, the other being blocked, if two messages
arrive and the exclusive or of the first bits is not '1'.

Then omega networks for connecting two devices, four devices or
eight devices are built from this switch are shown below. The
messages are sent with the most significant bit of the destination

For 16 devices connected to the same or different 16 devices,
the omega network is built from the primitive switch as:

Note that connecting N devices requires N log_2(N) switches.
Given a set of random connections of N devices to N devices
with an omega network, this is mathematically a permutation,
then statistically 1/2 N connections may be made simultaneously.

Then, we can call a CPU-memory pair a node, reduce the drawing
of a node to a dot, and show a few connection topologies
for multiprocessors

"Ports" is the number of I/O ports the node must have.
"Max path" is the maximum number of hops a message must take
in order to get from one node to the farthest node. A message
may be as small as a Boolean signal or as large as a big

One measure of a multiprocessors communication capability is
"bisection bandwidth". Optimally choose to split the processors
into two equal groups and measure the maximum bandwidth that
may be obtained between the groups.

There are three main problems with massively parallel multiprocessors:
software, software and software.

The operating systems are marginally useful for multiprogramming where
a single program is to be run on a single data set using all the nodes
and all the memory. Today, the OS is almost no help and the programmer
must plan and program each node and every data transfer between nodes.

The programming languages are of little help. Java threads and
Ada tasks are not guaranteed to run on individual processors.
Posix threads are difficult to use and control.
MPI and VPM libraries allow the programmer to specifically allocate
tasks to nodes and control communication at the expense of significant
programming effort.

Only a small percent of application are in the class of
"embarrassingly parallel". Most applications require significant
design effort to obtain significant "speedup".

Yes, Amdahl's law applies to multiprocessors.
Given a multiprocessor with N nodes, the maximum speedup
to be expected compared to a single processor of the same type
as the node, is N. That would imply that 100% of the program
could be made parallel.

Give 32 processors and 50% of the program can be made fully parallel,
25% of the program can use half the processors and the rest of the program
must run sequentially, what is the speedup over one sequential processor?

Time sequentially is 100%                                     100%
                         50%   25%   25%            speedup = ------ = 3.55
Time multiprocessing is  --- + --- + --- = 28.125%            28.125%
                         32    16    1

far from the theoretical maximum of 32!

Note: "fully parallel" means the speedup factor is the number of processors.
      "half the processors" in this case is 32/2 = 16.
      the remaining 25% is sequential, thus factor = 1

Given 32 processors and 99% of the program can be fully parallel,

Time sequentially is 100%                              100%
                         99%   1%            speedup = ------ = 24.4
Time multiprocessing is  --- + -- = 4.1%               4.1%
                         32    1

about 3/4 the theoretical maximum of 32!

These easy calculations are only considering processing time.
In many programs there is significant communication time to
get the data to the required node and get the results to
the required node. A few programs may require more communication
time than computation time.

Consider a 1024 = 2^12 node multiprocessor.
Add 1,048,576 = 2^20 numbers as fast as possible on this multiprocessor.
Assume no communication cost (very unreasonable)
   step  action
      1  add 2^10 numbers to 2^10 numbers getting 2^10 partial sums
      2  add 2^10 numbers to 2^10 numbers getting 2^10 partial sums
2^9=512  add 2^10 numbers to 2^10 numbers getting 2^10 partial sums

         (so far fully parallel, now have only 2^19 numbers to add)

2^9+1    add 2^10 numbers to 2^10 numbers getting 2^10 partial sums
2^9+2    add 2^10 numbers to 2^10 numbers getting 2^10 partial sums
2^9+2^8  add 2^10 numbers to 2^10 numbers getting 2^10 partial sums

         (so far fully parallel, now have only 2^18 numbers to add)

see the progression:
2^9 + 2^8 + 2^7 + ... 2^2 + 2^1 + 2^0 = 1023 time steps
         and we now have 2^10 partial sums, thus only 2^9 or 512
         processors can be used on the next step

1024    add 2^9 numbers to 2^9 numbers getting 2^9 partial sums
        (using 1/2 the processors)
1025    add 2^8 numbers to 2^8 numbers getting 2^9 partial sums
        (using 1/4 the processors)
1033    add 2^0=1 number to 2^0=1 number to get the final sum
        (using 1 processor)

                     sequential time   1,048,575
Thus our speedup is  --------------- = ----------- = 1015
                     parallel time       1033

The percent utilization is 1015/1024 * 100% = 99.12%

Remember: Every program has a last, single, instruction to execute.

Jack Dongarra, an expert in the field of multiprocessor programming
says "It just gets worse as you add more processors."

Top 500 multiprocessors www.top500.org/lists/2006/06

    <- previous    index    next ->

Other links

Go to top