ARC INTRODUCTION TO MPI GEORGI YANAKIEV (YANAKIEV@ARC.UNM.EDU) MARK ENLOW (MENLOW@ARC.UNM.EDU) CHUANYI DING (DING@ARC.UNM.EDU) JIM WARSA (WARSA@BOLTZMANN.UNM.EDU) © Copyright 1995, The Maui Project/University of New Mexico _________________________________________________________________ WHAT IS MPI? * Message passing is the communication model used on massively parallel machines with distributed-memory architectures (can be used on shared memory machines also) * The goal of the standard is to establish a widely used language- and platform-independent standard for writing message-passing programs * The interface should establish a practical, portable, efficient, and flexible standard not too different from current practice MPI STANDARDIZATION EFFORT * About 60 people from 40 organizations in the United States and Europe participated * Influenced by: + Venus (IBM) + NX/2 (INTEL) + Express (Parasoft) + Vertex (nCUBE) + P4 (ANL) + PARMACS (ANL) * Other Contributions: + Zipcode (MSU) + Chimp (Edinburgh University) + PVM (ORNL, UTK, Emory U.) + Chameleon (ANL) + PICL (ANL) MILESTONES OF THE MPI FORUM * April 1992: Williamsburg, VA, Workshop on Message Passing Standards * November 1992: MPI draft proposal (MPI1) from ORNL * November 1992: Minneapolis working group meeting and e-mail discussion * November 1993: Draft MPI standard at Supercomputing '93 * May 5, 1994: Current Version * Meetings and e-mail discussion groups constitute the MPI Forum _________________________________________________________________ FURTHER INFORMATION * "MPI: A Message-Passing Interface Standard" document available at ARC in /usr/local/ftp/pub/mpi-report.ps (comments@cs.utk.edu) * Usenet newsgroup comp.parallel.mpi * WWW Sites + UTK/ORNL http://www.netlib.org/mpi/index.html + ANL http://www.mcs.anl.gov/mpi + MSU http://www.erc.msstate.edu/mpi + LAM Project http://www.osc.edu/lam.html _________________________________________________________________ TARGET IMPLEMENTATION * MIMD parallel computation models, including SPMD * Distributed-memory clusters and multi-processors * Shared-memory platform support * Support for virtual process topologies * No dynamic task spawning (fixed number of available processes) * Initial processor allocation and binding to physical processors and interprocessor hardware communications are left to vendor implementations * Explicit shared-memory operations, I/O functions, and task management are not specified in the standard LANGUAGE BINDING * Fortran 77 and ANSI C * F90 is assumed to follow the Fortran 77 binding and C++ the ANSI C binding * Standards for message passing between languages are not specified * MPI_ is prefixed to all MPI names (functions, subroutines, and constants) _________________________________________________________________ POINT-TO-POINT COMMUNICATION * Message sending example MPI_SEND (mesg, len(mesg), MPI_CHAR, 1, 99, MPI_COMM_WORLD, ierror) * Message receiving example MPI_RECV (mesg, 20, MPI_CHAR, 0, 99, MPI_COMM_WORLD, status, ierror) * Message envelope: source, destination, tag, communicator * Blocking and non-blocking modes available * Standard, buffered, ready, and synchronous modes available * Data type matching at all three stages of message-passing communication * Data type and representation conversion to support message-passing in a heterogeneous environment * Communication semantics (order, progress, fairness, and resource limitations) are specified by the standard WEB PAGES FOR MPI ROUTINES NON-BLOCKING POINT-TO-POINT COMMUNICATION * Performance improvement by overlapping communication and computation * Communication initiation (the prefix I is for immediate) MPI_ISEND (buf,count,datatype,dest,tag,comm,request,ierror) MPI_IRECV (buf,count,datatype,dest,tag,comm,request,ierror) * Communication completion MPI_WAIT (request, status, ierror) MPI_TEST (request, flag, status, ierror) MPI_REQUEST_FREE (request, ierror) * Multiple completion wait/test also available * Incoming messages can be checked without actually receiving them: flag returns true if a match occurs and status returns the message envelope MPI_IPROBE (source, tag, comm, flag, status, ierror) * Pending communication requests can be cancelled to gracefully free resources MPI_CANCEL (request, ierror) * Multiple probe request modes also available _________________________________________________________________ COLLECTIVE COMMUNICATION * The standard specifies collective communication operations over groups of processes * Barrier synchronization across all group members MPI_BARRIER (comm, ierror) * Broadcast from one member of the group to all other members MPI_BCAST (buffer, count, datatype, root, comm, ierror) * Gather data from all group members to one process MPI_GATHER (sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm, ierror) * Scatter data from one group member to all other members MPI_SCATTER (sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm, ierror) * Global reduction operations such as max, min, sum, product, and min and max operations are also available _________________________________________________________________ MPI ENVIRONMENTAL AND PROCESS MANAGEMENT * Startup and shutdown MPI_INIT (ierror) MPI_FINALIZE (ierror) MPI_INITIALIZED (flag, ierror) MPI_ABORT (comm, errorcode, ierror) * Process management MPI_COMM_RANK (comm, rank, ierror) MPI_COMM_SIZE (comm, size, ierror) * To use all available processes in a single process group without further management, use MPI_COMM_WORLD as the communicator * Process group communicators, virtual process topologies, and communication context functions are provided to manage communication "universes" between and among process groups * Environmental inquiry functions, program execution timers, and error handling routines are also specified in the standard _________________________________________________________________ Running MPI Programs at ARC MPI IMPLEMENTATIONS AT ARC 1. MPICH (MPI/Chameleon) - Argonne National Laboratory and Missippi State University 2. LAM (Local Area Multicomputer) - Ohio Supercomputing Center 3. CHIMP (Common High-level Interface to Message Passing) - Edinburgh Parallel Computing Centre _________________________________________________________________ EXAMPLES Initial Example The initial examples use mpich, the Argonne MPI implementation. * Add your username to your .rhosts file if your login is the same on all machines * Create working directory > mkdir MPI > cd MPI * Obtain the example files > cp /usr/local/mpi/workshop/examples/* . * Your working directory should now contain the files: Makefile test1.F test2.F test3.F * In order to run each program you have to create, in the working directory, a file called filename.pg (e.g., test1.pg). This processor group file should contain: local 0 machineA 1 /u/username/MPI/test1 machineB 1 /u/username/MPI/test1 Similarly, for the files test2.pg and test3.pg. * The previous processor group file is required in order to start execution on the local machine and two remote machines. Executing the program test1.F on the local machine starts 3 processes: the "master" process on the local machine and one on each of machineA and machineB. * A sample processor group file (e.g., test1.pg), for running on ARC, is the following: local 0 taos 1 /u/username/MPI/test1 zia 1 /u/username/MPI/test1 * Therefore if you are logged onto acoma, the "master" process will run on acoma (local machine) and two more processes will run on taos and zia. * In order to compile a program (e.g., test1.F), type: > make test1 * In order to run the above program you type: > test1 * By modifying the corresponding processor group file you may run the program on 1, 2, or 3 processors. * When running the programs, messages on the screen refer to 'Process 0' (or 1, or 2). The rank of each process corresponds to the order in which they appear in the "process group" file. For the file test1.pg shown earlier: acoma is Process 0 taos is Process 1 zia is Process 2 File: test1.F A simple exchange of messages from one process to another. One process sends two messages, (2 arrays), which are received in the reverse order by the other process. Files: test2.F and test3.F Calculation of the definite integral of a function using the trapezoid rule. The range of integration is divided into n sub-intervals. Each process calculates the result of integration over a specific number of sub-intervals. Each process sends its result to a root process, where the final result is stored. Program test2.F uses point-to-point communication mode and program test3.F uses collective communication mode. _________________________________________________________________ More Advanced Examples (not currently available, due to hardware problems) These examples demonstrate the heterogeneous makefile system that quickly switches between the various MPI implementations and automatically performs the make for each of the Unix machines (RS/6000, SGI, HP, and SUN). Package Name Developer Location of Package mpich Argonne and MSU /usr/local/mpich chimp Edinburgh /usr/local/chimp lam Ohio State University /usr/local/lam52 Need for a Makefile System Switching between 3 MPI implementations The ability to switch between three different MPI implementations allows the developer to check portability/compatibility, helps with debugging, and allows the selection of the optimum implementation for an individual application and environment Heterogeneous cluster of workstations A single make invocation compiles code for all Unix architectures, greatly simplifying the use of a heterogeneous cluster of workstations. HETEROGENEOUS MAKEFILE ALL: aix_build sgi_build **** Architecture List **** ARCH = rs6000 **** Default Architecture **** (needed to prevent include error) # Settings for mpich using p4 - #MPI = mpich | #COMM = ch_p4 | # Settings for chimp | Switches between #MPI = chimp | MPI Architectures #COMM = | # Settings for lam | MPI = lam | COMM = | # End Settings - WORK_DIR = mpi *** Working Directory *** *** Set to source code directory *** RM = rm **** All architecture and MPI implementation differences **** **** are handled with the following include **** include ../$(ARCH).$(MPI).$(COMM).make programs: $(ARCH)/test1 $(ARCH)/first **** Recursive make for each architecture **** aix_build: rsh acoma "cd $(WORK_DIR); make -e ARCH=rs6000 programs" sgi_build: rsh cochiti "cd $(WORK_DIR); make -e ARCH=sgi programs" *** Instructions to create object and executable *** $(ARCH)/first: $(ARCH)/first.o Makefile cd $(ARCH); $(CC) -o first first.o $(LOPT) $(ARCH)/first.o: first.c Makefile cd $(ARCH); $(CC) -c $(COPT) ../first.c $(ARCH)/test1: $(ARCH)/test1.o Makefile cd $(ARCH) ; $(FC) -o test1 test1.o $(FLOPT) $(ARCH)/test1.o: test1.F Makefile cd $(ARCH) ; $(FC) -c $(FCOPT) ../test1.F *** Housekeeping -- please use *** clean: $(RM) -f rs6000/* $(RM) -f sgi/* MAKEFILE SYSTEM DEMONSTRATION 1. Create a working directory called mpi under your home directory 2. Copy all files from /usr/local/mpi to this directory 3. Correct .cshrc a. run /usr/local/bin/reset_environment OR b. compare with /usr/local/environment/.cshrc and correct manually 4. Edit Makefile and uncomment the two lines for the desired implementation 5. Type make to create executables 6. Edit machine configuration files a. mpich --> *first.pg b. chimp --> *first.chp c. lam --> *first.bhost lamboot.conf * user dependent paths! 7. Run executables a. mpich rs6000/first -p4pg ../first.pg b. chimp chimp first.chp c. lam lam52_mpi first.bhost ADDITIONAL EXAMPLES /usr/local/mpich/examples /usr/local/chimp/examples /usr/local/lam52/examples _________________________________________________________________ ©Copyright, The Maui Project/University of New Mexico Last revised on March 1, 1995 by Georgi Yanakiev, yanakiev@arc.unm.edu