MPI-LITE User Manual

Release 1.1

 

 

(04/16/97)

 

 

 

 

UCLA Parallel Computing Lab

 

 

 

 

 

 

Contact: Prof. Rajive Bagrodia

3531 Boetler Hall

Computer Science Department

University of California

Los Angeles, CA 90095

 

 

 

Tel: (310) 825-0956

Email: rajive@cs.ucla.edu

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

© Copyright 1997

 

 

MPI-LITE

 

 

Authors: Rajive Bagrodia, Punit Bhargava and Sundeep Prakash

 

Computer Science Department

University of California, Los Angeles

Los Angeles, CA 90095

 

 

Permission to use, copy, modify, and redistribute this software and its documentation for research, educational, and non-profit purpose and without fee is hereby granted provided that the above copyright notice appears in all copies and that both the copyright notice and this permission notice appear in supporting documentation.

 

Redistribution for profit is prohibited. Contact author if you wish to use this software and/or its documentation in a commercial product.

 

This software is provided "AS IS" and without expressed or implied warranty of any kind. Neither the University of California nor the Authors make any representations about the suitability of this software for any purpose.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MPI-LITE : A multithreading support for MPI

 

Introduction

 

MPI-LITE is a portable library to support multithreaded MPI programs. With the standard MPI distributions, each process is typically mapped to a unique processor; the only way to map multiple processes to a processor is by using multiple heavy weight processes (e.g., UNIX processes). MPI-LITE provides a kernel for creation, termination, and scheduling of user level threads. The kernel is written entirely in ANSI C and can thus be easily ported to a variety of platforms and operating systems. A core set of the most commonly used MPI routines are supported by MPI-LITE and can be used by the threads for communication and synchronization. The functions currently supported by MPI-LITE are listed in Appendix B. Additional functions will be added as needed.

 

The routines for inter-thread communication are syntactically identical to those for inter-process communication except for the use of a special prefix to distinguish between the two. A few minor modifications, listed in Appendix A, are necessary to an MPI program for it to run with the MPI-LITE library.

 

The current release of MPI-LITE has been tested on the IBM SP2 running AIX version 4.1. We are in the process of collecting detailed performance measurements for MPI-LITE and comparing the performance with native multi-threading packages that are available on contemporary MPP platforms. A shared memory implementation of MPI-LITE is also planned for the near future.

 

Thread Mapping & Scheduling

 

Each thread in MPI-LITE executes a copy of the given MPI program. The total number of threads in the program are specified as inputs to the MPI-LITE program, i.e. given as command line arguments to the executable. In the current version, we only support an automatic block mapping scheme for allocating threads to processors: given T threads and N processors, each processor will have (T div N) or (T div N)+1 threads, with the (T mod N) processors with the lowest id receiving the additional thread. (Unique thread and processor ids are assigned using the appropriate MPI functions).

 

A round robin scheduler is used to schedule threads mapped to a processor. Each thread is in one of three states : Executing, Blocked or Waiting. A thread is in the executing stage if it is currently being executed on the processor; at most one thread on a processor may be executing. A thread is blocked if it has executed a receive statement and its message buffer does not contain a 'matching' message to complete the receive operation; otherwise

the thread is said to be waiting. The scheduler maintains the waiting threads in a separate queue - called the wait-q.

 

A thread moves from the executing to the blocked state autonomously. At this point, the scheduler inserts incoming messages into the message buffer of each thread, updates the state of a thread from blocked to waiting as needed, and schedules the first thread in the wait-q for execution, by setting its state to executing.

 

Example

The following Program illustrates the basic message passing constructs in MPI. Process communicate in a ring structure. No_of_processes stores the number of processes running the program. 0th process initiates the first message by sending it to 1. The process with id i on receiving a message from the process with id (i-1)%p passes the message to the process with id (i+1)%p. Such passing is done for MAX_MESSAGES number of times. First we give the standard MPI version followed by the MPI-LITE version. In MPI-LITE version No_of_processes is the total number of threads and the threads communicate in a ring structure.

MPI Version

#include <stdio.h>

#include "mpi.h"

#define MAX_MESSAGES 100

 

void main(int argc, char **argv){

int my_rank,source, dest, i;

int No_of_processes;

int tag = 50, message=1,size_of_message=1;

MPI_Status status;

 

MPI_Init(&argc,&argv);

MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

MPI_Comm_size(MPI_COMM_WORLD, &p);

 

if ( my_rank == 0 ) {

for(i=0;i<MAX_MESSAGES;i++){

dest = 1; source = p-1;

printf("%d messages sent : ", message);

MPI_Send(&message, size_of_message, MPI_INT,dest,tag,MPI_COMM_WORLD);

MPI_Recv(&message, size_of_message, MPI_INT, source, tag, MPI_COMM_WORLD, &status);

printf("received %d messages\n", message);

}

}

else

for(i=0;i<MAX_MESSAGES;i++){

source = my_rank - 1; dest = (my_rank + 1)%p;

MPI_Recv(&message, size_of_message, MPI_INT, source, tag, MPI_COMM_WORLD, &status);

MPI_Send(&message, size_of_message, MPI_INT,dest,tag,MPI_COMM_WORLD);

}

MPI_Finalize();

}

 

MPI-LITE Version

#include <stdio.h>

#include <mpi.h>

#define SIM_EXTERN extern

#include "myglobals.h"

#undef SIM_EXTERN

#define MAX_MESSAGES 100

 

int SimTargetProg(void){

int my_rank,source, dest, i;

int No_of_processes;

int tag = 50, message=1,size_of_message=1;

MPI_Status status;

 

MPI_Init__();

MPI_Comm_rank__(MPI_COMM_WORLD, &my_rank);

MPI_Comm_size__(MPI_COMM_WORLD, &p);

 

if ( my_rank == 0 ) {

for(i=0;i<MAX_MESSAGES;i++){

dest = 1; source = p-1;

printf("%d messages sent : ", message);

MPI_Send__(&message, size_of_message, MPI_INT,dest,tag,MPI_COMM_WORLD);

MPI_Recv__(&message, size_of_message, MPI_INT, source, tag, MPI_COMM_WORLD, &status);

printf("received %d messages\n", message);

}

}

else {

for(i=0;i<MAX_MESSAGES;i++){

source = my_rank - 1; dest = (my_rank + 1)%p;

MPI_Recv__(&message, size_of_message, MPI_INT, source, tag, MPI_COMM_WORLD, &status);

MPI_Send__(&message, size_of_message, MPI_INT,dest,tag,MPI_COMM_WORLD);

}

}

MPI_Finalize__();

}

 

Variable Privatization

 

In an MPI program, each process has a separate copy of the permanent variables i.e. global and static variables within functions. If MPI-LITE executes the unmodified MPI program, all threads on the same host process will access a single copy of each permanent variable. To prevent this, it is necessary to 'privatize' the permanent variables such that each thread has a local copy. A preprocessor is provided with MPI-LITE for this purpose.

Each permanent variable is redeclared with an additional dimension whose size is equal to the maximum number of threads in a host process. And each reference to the permanent variable is modified appropriately by the parser, such that each thread uses its id to access its own copy of the permanent variable. This process of adding an additional dimension to the permanent variables is referred to as privatization. A preprocessor is provided with MPI-LITE to privatize global variables.

 

Example

 

The following program illustrates the use of global variables and its implication on MPI-LITE. One has to use a localizer that privatizes the variables. Following program calculates FFT of a given N numbers. To run the following program as an MPI-LITE program one should follow the guidelines given in Appendix A.

MPI Version

#include <math.h>

#include <mpi.h>

#define N 32768

#define Log_N 15

#define cluster 4096

#define Log_P 3

 

double y[cluster][2];

 

void reverse_data() {

double temp[2]; int i;

 

for (i=0;i<cluster;i++){

if (rev(i)<i){

temp[0]=y[i][0];

temp[1]=y[i][1];

y[i][0]=y[rev(i)][0];

y[i][1]=y[rev(i)][1];

y[rev(i)][0]=temp[0];

y[rev(i)][1]=temp[1];

}

}

}

void local_fft(int sign, int my_rank) {

int i,j,k,m; double w[2],w_m[2],u[2],t[2];

 

for (i=1;i<=Log_N-Log_P;i++){

m = power(2,i);w_m[0] = cos ((2 * M_PI)/m);w_m[1] = sign * sin ((2 * M_PI)/m);w[0]=1;w[1]=0;

for(j=0;j<=(m/2)-1;j++){

for(k=j;k<=cluster-1;k+=m){

t[0] = w[0]*y[k+m/2][0]-w[1]*y[k+m/2][1];

t[1] = w[0]*y[k+m/2][1]+w[1]*y[k+m/2][0];

u[0] = y[k][0]; u[1] = y[k][1];

y[k][0] = u[0] + t[0];

y[k][1] = u[1] + t[1];

y[k+m/2][0] = u[0] - t[0];

y[k+m/2][1] = u[1] - t[1];

}

w[0] = w[0]*w_m[0] - w[1]*w_m[1];w[1] = w[0]*w_m[1] + w[1]*w_m[0];

}

}

}

 

void main(int argc, char **argv){

int my_rank, p, source, dest, priority, tag=50, m, m_red, i, k, j, phase, my_pair, rank[Log_P], size ;

MPI_Status status;

double w[2],w_m[2],t[2],u[2],b[cluster][2];

MPI_Comm my_comm[Log_P];

MPI_Init(&argc,&argv);

MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

MPI_Comm_size(MPI_COMM_WORLD, &p);

for (i=0;i<cluster;i++){

y[i][0]=1;y[i][1]=0;

}

reverse_data();

local_fft(1,my_rank);

/* Calculating the final FFT by communication and computation */

for(i=1+Log_N-Log_P;i<=Log_N;i++){

k=i-Log_N+Log_P-1; m = power(2,i);m_red = power(2,i-Log_N+Log_P);priority= my_rank % m_red;

if ( priority < m_red/2){

MPI_Send (y,2*cluster,MPI_DOUBLE,1-rank[k],tag,my_comm[k]);

MPI_Recv (b,2*cluster,MPI_DOUBLE,1-rank[k],tag,my_comm[k],&status);

for (j=0;j<cluster;j++){

y[j][0] = y[j][0] + b[j][0];

y[j][1] = y[j][1] + b[j][1];

}

}

else{

w_m[0] = cos ((2 * M_PI)/m);w_m[1] = sin ((2 * M_PI)/m);

w[0]=cos((2*M_PI*(priority-m_red/2))/m); w[1]=sin((2*M_PI*(priority-m_red/2))/m);

for (j=0;j<cluster;j++){

y[j][0] = y[j][0]*w[0]-y[j][1]*w[1];

y[j][1] = y[j][0]*w[1]+y[j][1]*w[0];

w[0] = w[0]*w_m[0] - w[1]*w_m[1]; w[1] = w[0]*w_m[1] + w[1]*w_m[0];

}

MPI_Recv (b,2*cluster,MPI_DOUBLE,1-rank[k],tag,my_comm[k],&status);

MPI_Send (y,2*cluster,MPI_DOUBLE,1-rank[k],tag,my_comm[k]);

for (j=0;j<cluster;j++){

y[j][0] = b[j][0] - y[j][0];

y[j][1] = b[j][1] - y[j][1];

}

}

}

/* Omitted the Code for calculating the Inverse FFT */

for (j=0;j<cluster;j++) /* Doing validity check.... */

if (y[j][0]!=N ){

printf("I did some mistake!! proc %d at %d with value %f\n", my_rank,j,y[j][0]);

break;

}

MPI_Finalize();

}

 

Communicator Implementation

 

In MPI, each process maintains a list of communicators that it is a member of, and has a message queue for each communicator. In MPI-LITE, each process maintains a list of communicators that any of its threads is a member of, and on each communicator, maintains a message queue for each local member thread.

 

Message Delivery

 

In MPI, messages are sent only between processes. In MPI-LITE, messages data and acknowldgement messages are exchanged between threads, which may be on the same or different processes. Consequently, MPI-LITE needs to support both interthread message delivery and interprocess message delivery. Interthread message delivery is obviously faster, and depending on the semantics of the send operation, is made even faster by passing only a pointer to the message body. For example, messages sent using MPI_Issend are passed as pointers, since the message body will not be reused by the sending thread until the receiver thread has accepted the message. This assumption is not

valid for messages sent using MPI_Ibsend however, so a copy of the message body is made and sent. Interprocess message delivery is performed using MPI.

 

Collective Communication Functions

 

All collective communication, including that used by the communicator manipulation functions, is internally implemented in most MPI implementations as a set of point to point communication operations. For example, an MPI_Bcast on a communicator happens in two stages: (a) Each process dynamically configures a tree, using the total number of processes in the communicator. Its own position in the tree is determined by its rank, and (b) Each process then waits for the broadcast message from its parent using a receive statement, and forwards the message to its children, using send statements. Assume a communicator with 5 processes. In a simple complete binary tree, (We use a binary tree only for simplicity in exposition. Most implementations use more efficient trees.), process 0 would be the root, processes 1 and 2 its children, processes 3 and 4 the children of process 1, and process 5 the child of process 2. The send and receive statements issued at each process should be obvious. A tree rooted at some process other than 0 is formed by creating the basic tree described above and swapping the root with desired root. The same tree construction algorithm is used for all collective communication calls, i.e. for barriers and reduces as well.

 

Consequently, all that is used to implement a collective communication call is a receive statement with exactly the same functionality as MPI_Recv, and a send statement with the same functionality as any of the MPI send statements. In order to prevent point-to-point receives from receiving messages sent in a collective communication operation, all that is used is a special tag for messages that are part of a collective communication operation.

 

 

Example

 

The following program evaluates the LU decomposition of a given matrix. It demonstrates the use of the collective calls in MPI, and their corresponding calls in MPI-LITE.

 

MPI Version

#include <stdio.h>

#include <math.h>

#include "mpi.h"

#define N 16

 

void main(int argc, char **argv){

int my_rank, p, source, dest, tag=50, i,j;

MPI_Status status;

float a[N],L[N],L_rec[N], sum;

 

MPI_Init(&argc,&argv);

MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

MPI_Comm_size(MPI_COMM_WORLD, &p);

if ( my_rank!= 0){

source = 0;

MPI_Recv(a,my_rank+1,MPI_FLOAT,source,tag,MPI_COMM_WORLD,&status);

}

else{

for(dest=1;dest<p;dest++){

for(j=0;j<=dest;j++) a[j]=j+1;

MPI_Send(a,dest+1, MPI_FLOAT, dest, tag, MPI_COMM_WORLD);

}

a[0]=1;

}

for(i=0;i<p;i++)

if ( my_rank== i){

sum= 0.0;

for(j=0;j<i;j++) sum += L[j]*L[j];

sum=a[i] - sum;

if ( sum < 0.0 ) exit;

L[i]=sqrt(sum);

MPI_Bcast(L,i+1,MPI_FLOAT,i,MPI_COMM_WORLD);

}

else{

MPI_Bcast(L_rec,i+1,MPI_FLOAT,i,MPI_COMM_WORLD);

if ( my_rank > i){

sum = 0.0;

for(j=0;j<i;j++) sum += L[j]*L_rec[j];

sum=a[i] - sum; L[i]=sum/L_rec[i];

}

}

/* Omitted Output the Matrix */

MPI_Finalize();

}

 

 

 

MPI-LITE Version

#include <stdio.h>

#include <math.h>

#include "mpi.h"

#define SIM_EXTERN extern

#include "myglobals.h"

#undef SIM_EXTERN

#define N 16

 

int SimTargetProg(void){

int my_rank, p, source, dest, tag=50, i,j;

MPI_Status status;

float a[N],L[N],L_rec[N], sum;

 

MPI_Init__();

MPI_Comm_rank__(&world, &my_rank);

MPI_Comm_size__(&world, &p);

if ( my_rank!= 0){

source = 0;

MPI_Recv__(a,my_rank+1,MPI_FLOAT,source,tag,MPI_COMM_WORLD,&status);

else{

for(dest=1;dest<p;dest++){

for(j=0;j<=dest;j++) a[j] = j+1;

MPI_Send__(a,dest+1,MPI_FLOAT, dest,tag,MPI_COMM_WORLD);

}

a[0]=1;

}

for(i=0;i<p;i++)

if ( my_rank== i){

sum= 0.0;

for(j=0;j<i;j++) sum += L[j]*L[j];

sum=a[i] - sum;

if ( sum < 0.0 ) exit;

L[i]=sqrt(sum);

MPI_Bcast(L,i+1,MPI_FLOAT,i,MPI_COMM_WORLD);

}

else{

MPI_Bcast(L_rec,i+1,MPI_FLOAT,i,MPI_COMM_WORLD);

if ( my_rank > i){

sum = 0.0;

for(j=0;j<i;j++) sum += L[j]*L_rec[j];

sum=a[i] - sum; L[i]=sum/L_rec[i];

}

}

/* Omitted Output of the Matrix. */

MPI_Finalize__(&err);

}

 

 

 

 

Appendix A

 

Follow the two steps to run any mpi program as an MPI-LITE program.

  1. Copy the mpi file into AutoLoc dirctory. In AutoLoc directory use the perl script automate.pl to generate target.c (MPI-LITE version of the program), copy target.c to Mpi_Lite directory:

e.g. automate.pl fft, where the MPI program is in fft.c

  1. Use the Makefile given with the source code of MPI-LITE (in Mpi_Lite directory) and call "make all" to get the executable in target.

 

Note: automate.pl uses the following perl script files :

add_std_headers.pl, changemain.pl, countstatics.pl, mpi_to_mpi-lite.pl, rem1.pl, rem2.pl, remove_std_headers.pl, splicelines.pl

 

Appendix B

 

Following functions are currently supported by MPI-LITE.

 

int MPI_Comm_dup__(int Comm, int *NewComm);

int MPI_Comm_split__(int Comm, int Color, int Key, int *NewComm);

int MPI_Reduce__(void *SendBuf, void *RecvBuf, int Count, int DataType, int ReduceOp, int Root,

int Comm);

int MPI_Allreduce__(void *SendBuf, void *RecvBuf, int Count, int DataType, int ReduceOp, int Comm);

int MPI_Barrier__(int Comm);

int MPI_Bcast__(void *Buf, int Count, int DataType, int Root, int Comm);

int MPI_Init__(void);

int MPI_Finalize__(void);

int MPI_Comm_rank__(int Comm, int *Rank);

int MPI_Comm_size__(int Comm, int *Size);

int MPI_Abort__(int Comm, int ErrorCode);

double MPI_Wtime__(void);

int MPI_Waitall__(int Count, int *Requests, int Statuses[][MPIF_STATUS_SIZE]);

int MPI_Wait__(int *Request, int *Status);

int MPI_Bsend__(void *Buf, int Count, int DataType, int Dest, int Tag, int Comm);

int MPI_Ibsend__(void *Buf, int Count, int DataType, int Dest, int Tag, int Comm, int *Request);

int MPI_Ssend__(void *Buf, int Count, int DataType, int Dest, int Tag, int Comm);

int MPI_Issend__(void *Buf, int Count, int DataType, int Dest, int Tag, int Comm, int *Request);

int MPI_Rsend__(void *Buf, int Count, int DataType, int Dest, int Tag, int Comm);

int MPI_Irsend__(void *Buf, int Count, int DataType, int Dest, int Tag, int Comm, int *Request);

int MPI_Send__(void *Buf, int Count, int DataType, int Dest, int Tag, int Comm);

int MPI_Isend__(void *Buf, int Count, int DataType, int Dest, int Tag, int Comm, int *Request);

int MPI_Recv__(void *Buf, int Count, int DataType, int Src, int Tag, int Comm, int *Status);

int MPI_Irecv __(void *Buf, int Count, int DataType, int Src, int Tag, int Comm, int *Status,

int *Request);