next up previous
Next: Introduction

Performance Prediction of Parallel Programs

Sundeep Prakash

PhD Dissertation
Computer Science Department
University of California
Los Angeles, CA

Abstract:

Two important paradigms for programming parallel computers are data parallelism and task parallelism. Data parallel programs have a single locus of control and task parallel programs have multiple locii of control. Multicomputers of today are essentially MIMD computers, and hence are more amenable to the task parallel programming paradigm. Consequently, a data parallel program must be translated to a task parallel program and then executed on multicomputers. In order to maintain a single locus of control, explicit global synchronizations must be inserted in the translated task parallel program. This results in performance degradation, since global synchronizations are very expensive. The first part of this dissertation presents techniques to reduce and replace such global synchronizations. In particular, an adaptive synchronization technique is presented which eliminates global synchronizations that cannot be eliminated by the widely known inspector-executor strategy. Simulation studies are presented which estimate the potential performance improvement due to this technique.

The work in the second and major part of this dissertation was motivated by the need for a fast parallel simulation engine of data parallel programs. Existing simulation engines are either sequential and slow, or parallel and not portable. We first present the design and implementation of DPSIM, a parallel simulation engine for data parallel programs written in UC. DPSIM exploits the deterministic nature of data parallel programs to optimize its underlying conservative simulation protocol. These optimizations are specific to the data parallel programming paradigm and hence not restricted to a particular architecture or programming language. Experimental results on the performance of DPSIM for a set of scientific applications including Gauss Jordan elimination and matrix multiplication show speedup factors as high as 11 when using 16 processors.

We then present the design and implementation of MPISIM, a simulation library for the parallel simulation of MPI programs. MPISIM demonstrates concepts that can be used in the simulation of any general message passing language. It uses a conservative simulation protocol that integrates the conditional event protocol and the null message protocol. The null message protocol provides fast progress when good lookaheads are available. Lookaheads are extracted dynamically, and may be extracted at compile time as well, using program analysis. When good lookaheads are not available, the conditional event protocol provides fast progress. Deterministic sections of code are detected at runtime, and the same optimizations used for DPSIM are automatically applied resulting in extremely fast simulation. We present results for the validation and performance of MPISIM for a subset of the NAS Parallel Benchmarks (NPB 2). As with DPSIM, excellent speedups are obtained on increasing the number of processors used by MPISIM.





next up previous
Next: Introduction



Andy Kahn
Wed Jun 25 20:28:02 PDT 1997