MPI-LITE User Manual Release 1.1 (04/16/97) UCLA Parallel Computing Lab Contact: Prof. Rajive Bagrodia 3531 Boetler Hall Computer Science Department University of California Los Angeles, CA 90095 Tel: (310) 825-0956 Email: rajive@cs.ucla.edu ¨ Copyright 1997 MPI-LITE Authors: Rajive Bagrodia, Punit Bhargava and Sundeep Prakash Computer Science Department University of California, Los Angeles Los Angeles, CA 90095 Permission to use, copy, modify, and redistribute this software and its documentation for research, educational, and non-profit purpose and without fee is hereby granted provided that the above copyright notice appears in all copies and that both the copyright notice and this permission notice appear in supporting documentation. Redistribution for profit is prohibited. Contact author if you wish to use this software and/or its documentation in a commercial product. This software is provided "AS IS" and without expressed or implied warranty of any kind. Neither the University of California nor the Authors make any representations about the suitability of this software for any purpose. MPI-LITE : A multithreading support for MPI Introduction MPI-LITE is a portable library to support multithreaded MPI programs. With the standard MPI distributions, each process is typically mapped to a unique processor; the only way to map multiple processes to a processor is by using multiple heavy weight processes (e.g., UNIX processes). MPI-LITE provides a kernel for creation, termination, and scheduling of user level threads. The kernel is written entirely in ANSI C and can thus be easily ported to a variety of platforms and operating systems. A core set of the most commonly used MPI routines are supported by MPI-LITE and can be used by the threads for communication and synchronization. The functions currently supported by MPI-LITE are listed in Appendix B. Additional functions will be added as needed. The routines for inter-thread communication are syntactically identical to those for inter- process communication except for the use of a special prefix to distinguish between the two. A few minor modifications, listed in Appendix A, are necessary to an MPI program for it to run with the MPI-LITE library. The current release of MPI-LITE has been tested on the IBM SP2 running AIX version 4.1. We are in the process of collecting detailed performance measurements for MPI- LITE and comparing the performance with native multi-threading packages that are available on contemporary MPP platforms. A shared memory implementation of MPI- LITE is also planned for the near future. Thread Mapping & Scheduling Each thread in MPI-LITE executes a copy of the given MPI program. The total number of threads in the program are specified as inputs to the MPI-LITE program, i.e. given as command line arguments to the executable. In the current version, we only support an automatic block mapping scheme for allocating threads to processors: given T threads and N processors, each processor will have (T div N) or (T div N)+1 threads, with the (T mod N) processors with the lowest id receiving the additional thread. (Unique thread and processor ids are assigned using the appropriate MPI functions). A round robin scheduler is used to schedule threads mapped to a processor. Each thread is in one of three states : Executing, Blocked or Waiting. A thread is in the executing stage if it is currently being executed on the processor; at most one thread on a processor may be executing. A thread is blocked if it has executed a receive statement and its message buffer does not contain a 'matching' message to complete the receive operation; otherwise the thread is said to be waiting. The scheduler maintains the waiting threads in a separate queue - called the wait-q. A thread moves from the executing to the blocked state autonomously. At this point, the scheduler inserts incoming messages into the message buffer of each thread, updates the state of a thread from blocked to waiting as needed, and schedules the first thread in the wait-q for execution, by setting its state to executing. Example The following Program illustrates the basic message passing constructs in MPI. Process communicate in a ring structure. No_of_processes stores the number of processes running the program. 0th process initiates the first message by sending it to 1. The process with id i on receiving a message from the process with id (i-1)%p passes the message to the process with id (i+1)%p. Such passing is done for MAX_MESSAGES number of times. First we give the standard MPI version followed by the MPI-LITE version. In MPI-LITE version No_of_processes is the total number of threads and the threads communicate in a ring structure. MPI Version #include #include "mpi.h" #define MAX_MESSAGES 100 void main(int argc, char **argv){ int my_rank,source, dest, i; int No_of_processes; int tag = 50, message=1,size_of_message=1; MPI_Status status; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &p); if ( my_rank == 0 ) { for(i=0;i #include #define SIM_EXTERN extern #include "myglobals.h" #undef SIM_EXTERN #define MAX_MESSAGES 100 int SimTargetProg(void){ int my_rank,source, dest, i; int No_of_processes; int tag = 50, message=1,size_of_message=1; MPI_Status status; MPI_Init__(); MPI_Comm_rank__(MPI_COMM_WORLD, &my_rank); MPI_Comm_size__(MPI_COMM_WORLD, &p); if ( my_rank == 0 ) { for(i=0;i #include #define N 32768 #define Log_N 15 #define cluster 4096 #define Log_P 3 double y[cluster][2]; void reverse_data() { double temp[2]; int i; for (i=0;i #include #include "mpi.h" #define N 16 void main(int argc, char **argv){ int my_rank, p, source, dest, tag=50, i,j; MPI_Status status; float a[N],L[N],L_rec[N], sum; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &p); if ( my_rank!= 0){ source = 0; MPI_Recv(a,my_rank+1,MPI_FLOAT,source,tag,MPI_COMM_WORLD,&status); } else{ for(dest=1;dest i){ sum = 0.0; for(j=0;j #include #include "mpi.h" #define SIM_EXTERN extern #include "myglobals.h" #undef SIM_EXTERN #define N 16 int SimTargetProg(void){ int my_rank, p, source, dest, tag=50, i,j; MPI_Status status; float a[N],L[N],L_rec[N], sum; MPI_Init__(); MPI_Comm_rank__(&world, &my_rank); MPI_Comm_size__(&world, &p); if ( my_rank!= 0){ source = 0; MPI_Recv__(a,my_rank+1,MPI_FLOAT,source,tag,MPI_COMM_WORLD,&status); else{ for(dest=1;dest i){ sum = 0.0; for(j=0;j