Abstract:
Efficient I/O implementations can have a significant impact on the performance of parallel applications. This paper describes the design and implementation of PIOSIM, a parallel simulation library for MPI-IO programs. The simulator can be used to predict the performance of existing MPI-IO programs as a function of architectural characteristics, caching algorithms, and alternative implementations of collective I/O operations. This paper describes the simulator and presents the results of a number of performance studies to evaluate the impact of the preceding factors on a set of MPI-IO benchmarks, including programs from the NAS benchmark suite.
Simulation models are commonly used to evaluate the performance of parallel architectures and predict the impact of algorithmic and architectural innovations on the performance of parallel applications. To reduce simulation times, most simulators typically use direct execution to simulate the local portions of their code and use models only for the communication and input/output events. Even with direct execution, sequential simulation of large parallel programs can be very time consuming [3][5][10]. This has lead to a variety of attempts to use parallel execution to reduce simulation times for models that simulate parallel programs [16][22][11]. Most of the existing parallel program simulators are used to evaluate the performance of the memory hierarchy, interconnection network, or processor architecture. To the best of our knowledge, none of the existing parallel simulators have been used to evaluate parallel I/O systems. Specific contributions of this paper include:
The rest of the paper is organized as follows: Parallel I/O and Parallel File Systems presents an overview of the subset of MPI-IO and the parallel file system that is simulated. Parallel I/O Simulator describes the architecture of the simulator and provides a brief desription of each of the primary components in the simulator. Experiments and Results describes the benchmarks and the experimental results. Conclusion and Related Work concludes and summarizes with related work.
Parallel file systems offer the high performance necessary for running comple parallel scientific applications. To provide maximum scalabilty, the machine's processors are generally divided into two separate groups, compute nodes (cnodes) and I/O nodes (ionodes). Cnodes run the user applications and send I/O requests to the ionodes. All disk access is performed by the ionodes, which each have their own secondary storage device (SSD) and manage a subset of the file system's total data. In this manner, the parallelism of the user application can be preserved when performing I/O. Some parallel file systems [7][6] even allow user processes to configure the underlying I/O parallelism to match their intended access pattern, further enhancing performance.
When properly tuned, the use of file system caching can drastically reduce the number of I/O requests which must wait to be serviced by the SSD. Unfortunately, choosing the correct cache configuration for peak performance is no mean feat. This is particularly true of parallel file systems, which offer a wide variety of techniques for managing the many caches spread across the ionodes and cnodes. One technique with much potential is cooperative caching. Presented in [12] as a set of high performance caching algorithms for use within a network file system, cooperative caching attempts to improve performance through better management of multiple client and server caches. This basically includes the ability to retrieve data from remote client caches and the global control of a portion of the client caches. With the low latency provided by the interconnection network of a parallel machine, cooperative caching may be even better suited for use within a parallel file system.
For users requiring maximum I/O performance, the pairing of a high performance parallel file system with a language-level parallel I/O library, such as MPI-IO, is a natural choice. MPI is a standard message passing library that provides a number of point-to-point and collective communication primitives [13]. MPI-IO is a proposed extension to MPI that would incorporate parallel I/O constructs [17]. The proposed constructs include both independent and collective I/O operations; asynchronous I/O calls; file access via independent file pointers, shared file pointers, and explicit offsets; and local and distributed datatype constructors,
The goal of our simulator is to provide a flexible environment for studying the performance, interaction, and design tradeoffs inherent with the use of high performance parallel file systems and parallel I/O libraries. In addition, support for parallel execution of the simulator will considerably reduce the turnaround time for the execution of detailed simulation models for complex system configurations and workloads.
Henceforth, we use the term Target Program to refer to the MPI-IO program whose performance is to be predicted and Target Machine to refer to the machine on which the target program executes. The term Simulator refers to the program that simulates execution of the target program and Host Machine refers to the sequential or parallel architecture on which the simulator executes. The simulator contains the following primary components:
Since we restrict our attention in to parallel I/O simulations, the interested reader is referred to [21] for information on MPISIM. The I/O simulator has been designed to be both modular and extensible: it is relatively easy to replace individual modules at each of the preceding levels. In particular, it is straightforward to replace the SSD models, to modify the caching or partitioning policies used by PFS-SIM, to modify the implementation of a specific collective MPI-IO operation supported by PIO-SIM, and to change the model of the interconnection network.
We assume that the target program includes local code blocks that are simulated by direct execution, MPI communication calls that are simulated by MPISIM, and MPI-IO commands that are handled by PIO-SIM. Each process of the target program is modeled by a single thread in the simulator; we refer to this thread as the target LP. When a target LP executes an MPI-IO call, it is intercepted by PIO-SIM. In the case of collective I/O, MPISIM's underlying communication facility is used for synchronization and communication between target LP's. If complex user-defined datatypes are used, which allow processes to access non-contiguous pieces of data with a single MPI-IO call, the single non-contiguous request is decomposed into multiple contiguous requests. PIO-SIM uses standard UNIX system I/O calls (e.g., read(), write(), etc.) to replicate the functionality of these operations in the simulator. These I/O requests are then passed to PFS-SIM, in order to determine I/O execution time for the simulated subsystem.
The I/O subsystem can be simulated at multiple levels of detail. At the most abstract level, a simple analytical model is provided which calculates the I/O time as a function of specified disk performance characteristics and the size of the data transfer. For more detailed analysis, PFS-SIM is used to simulate the parallel file system's cnodes, ionodes and disks, using seperate LPs to represent each of these entities. The I/O requests from PIO-SIM are passed to the cnode LP corresponding to the target machine's cnode on which the requesting target process would be running. Each request is then distributed to one or more ionode LPs based upon the physical data layout selected. Similarly, ionode LPs send their requests to disk LPs, where I/O service times are calculated. This path is then reversed, completing the I/O simulation. When caching is included in the simulation model, this path may of course be shortened or may include additional communication among ionode LPs and cnode LPs (e.g., block invalidation requests or cooperative caching messages).
PIO-SIM is the layer of the simulator which simulates MPI-IO calls. PIO-SIM implements all data access operations except calls using a shared file pointer. This includes positioning using explicit offsets and individual file pointers, independent and collective coordination, and blocking and nonblocking synchronism.
Two central problems in the simulation of MPI-IO programs include support for user defined datatypes and specific implementations of collective I/O operations. We address them in the following sections.
In MPI-IO, MPI datatypes are used to specify a "view" of the file. Effectively, this allows processes to execute I/O operations with different file access patterns. For example, if four processes were to access a common file which contained a matrix stored in row-major order, each process could open a filetype that represented only a submatrix of the bigger matrix (e.g., a specific row or column). Data which lies outside of this filetype would not be accessible to this process. Subsequent read and write operations to the file would pertain only to the section of the matrix that lie within the filetype used to open the file.
Since the simulator was designed to be portable and not rely on any specific implementation of MPI (e.g., internal data structures of the MPI runtime system), the attributes of each datatype are extracted by PIO-SIM when a call is made to a datatype constructor. This information is then stored within the simulator so that subsequent calls which use this datatype can refer to it as needed. Consequently, PIO-SIM maintains its own internal structure to represent datatypes. This structure consists of a list of base offsets and the length (number of bytes) of valid data at that offset. This structure is similar to I/O lists, as found in PMPIO, NAS's implementation of MPI-IO [20], as well as type maps, as found in the MPI-IO implementation for the NEC Cenju-3 [24]. We will refer to this structure simply as an IOL.
In MPI-IO, opening a file is a collective operation. Upon invocation, each target LP creates an IOL to represent its filetype (ftype IOL). Afterwards, each target LP exchanges its ftype IOL with all other target LP's. This is done to facilitate using global knowledge about any subsequent collective read/write operation for optimized collective I/O.
Finally, during a read or write call, the target process specifies a data buffer as well as its corresponding datatype (btype). A btype IOL is constructed at each call so that the data from the user buffer can be mapped to the correct locations within the file according to the ftype IOL. The mapping requires each target LP to create a request IOL that represents the entire I/O request. This results in using multiple (as well as partial) copies of the ftype IOL to "tile" the entire request. Subsequently, the I/O operation is performed by the target LP by traversing the request IOL, seeking to the base offset and reading/writing the number of bytes as specified in each element of the request IOL.
In the case of collective I/O operations, the underlying implementation of the collective I/O algorithm is hidden from the user. The basic idea behind collective I/O algorithms is to use global knowledge about the I/O transfer among the processes in order to combine many, small requests into fewer, large requests, reducing disk access overhead. PIO-SIM currently supports the following collective I/O implementations:
The basic structure and functionality of PFS-SIM is taken from the Vesta parallel file system, a highly scalable, experimental file system developed by IBM [6]. Many of Vesta's features have been included in the design of PFS-SIM, most notably, the use of an interface which allows user applications to configure the parallelism actually used to perform I/O.
In addition to the flexibility contained within the Vesta interface, PFS-SIM allows many of the file system's physical characteristics to be varied, such as cnode/ionode ratio, number of disk drives attached to each ionode, and disk drive characteristics, as well as a multitude of different cache configurations.
While the Vesta file system implemented caching only at the ionodes, PFS-SIM supports systems which have caching at both ionodes and cnodes. This caching setup offers a larger variety of configurations for study, including cooperative caching. PFS-SIM also supports a full range of cache sizes, cache block sizes, cache associativities (direct-mapped, fully associative, set associative) and write policies (write-through/write-back, write-allocate/write-around). Write-invalidation is used to maintain cache coherency, though write-update could easily be added. Block replacement uses the LRU algorithm.
The cache management policies implemented by PFS-SIM are:
The IBM-SP2 at UCLA was selected both as the target machine as well as the host machine (this is desirable when using direct execution to measure the performance of local code). The workload for our experiments was provided by using three MPI/MPI-IO programs plus a synthetic benchmark. The first two programs are from the NAS BTIO Parallel I/O Benchmarks v0.1. 3 The BTIO benchmark extends the original BT benchmark [4] by using MPI-IO to write data to a file at regular time intervals. The first BTIO program is the "simple" benchmark, where only non-collective I/O and primitive MPI datatypes are used. The second is the "full" benchmark, where data is fully described using MPI non-contiguous, user-defined datatypes, and collective I/O is used. Both benchmarks exhibit numerous seek operations. The third benchmark we have used is a basic out-of-core matrix multiplication application. Lastly, the synthetic benchmark generates random read/write requests with the starting address, request size, and read/write ratio generated by stochastic functions.
Figure 1 shows the simulator speedup from parallel simulation the NAS BTIO Simple and Full Benchmarks using the simple disk models. For the full benchmark, all three implementations of collective I/O are shown: Global Barrier (GB), Node Grouping (NG), and Two-Phase I/O (2P). Note that these results only address the issue of simulation performance and of the underlying collective I/O implementation.
Each graph shows the NAS Benchmark with three different configuration sizes: 4-processor problem, 9-processor problem, and 16-processor problem. Each configuration is run through a simulation with a varying number of processors. Note that there is considerably more speedup in the NAS Full Benchmark when using Two-Phase I/O than there was when using Global Barrier and Node Grouping. This occurred because the computation granularity is much greater in Two-Phase I/O, predominantly from the work required in the permutation phase, where the numerous small requests are merged into fewer, larger requests.
Also using a simple disk model, Figure 2 shows a comparison of the execution time as predicted by the simulator for different collective I/O implementations. Although node grouping has previously been shown to yield upto 8 times improvement [18], it has performed very poorly for the NAS Benchmark. This is mostly due to the fact that this particular problem size of the NAS Benchmark does not generate enough I/O requests to flood the interconnection network or cause any resource contention. Thus, by not allowing all the LP's to make their I/O requests simultaneously as in Global Barrier Collective I/O, we are simply delaying the execution.
The predicted time for Two-Phase I/O is clearly the best among the three implementations. This can be attributed to its ability to take the non-contiguous datatypes used by the BTIO Full Benchmark and merge them into larger, contiguous requests. Table 1 depicts the actual number of total seek and write requests made by the benchmark using Global Barrier and Two-Phase I/O. Node Grouping is not listed in the table, since it generates the same number of requests as Global Barrier.
The main components of disk access time are the time required to move the disk head to the correct track (seek time), the time spent waiting for the correct disk sectors to rotate beneath the head (rotational latency), and the time needed to move the data to/from the disk (data transfer time). Seek time and rotational latency are overheads which must be paid for each disk access and may constitute a significant portion of the I/O service time, particularly for small requests. By merging many small requests into a few large requests, overheads are reduced significantly. Also, the minimum amount of data which may be transferred to/from a disk in a single operation is a disk sector (typically 512 bytes). Requests which are smaller than this will result in more data being read than is necessary. However, when a number of these small requests access the same sector, they may be grouped together, reducing (or even eliminating) the amount of excess data which is read.
For initial testing of each of the cache management policies, we used the synthetic benchmark to generate the workload. The workload consisted of 5000 warm-up requests and 5000 measured requests, with each request generally consisting of between one and four data blocks. The percentage of read requests was set to eighty percent.
Figure 3 shows the execution time of this synthetic benchmark as the number of ionodes in the system is varied from 1 to 16 (the number of cnodes is held constant, either 4 or 8):
Figure 4 shows the number of data blocks which had to be actually read from the disk for the same workload and file system configurations:
Reads were chosen since this is where the major difference in performance of the various cache management policies can be seen. To eliminate any other variations due to a difference in the handling of write requests, the experiments were run using write-through and write-around, for which all cache policies behave the same. For each set of graphs, two scenarios were examined. First, the size of the cache at each ionode was fixed at 128 blocks. Second, the total cache size was fixed at 512 blocks and was divided equally among the ionodes. In this case, as the number of ionodes increased, the size of the cache at each ionode decreased. This allowed the benefit of having more disks available to be seperated from the benefit of having a larger aggregate cache size. These results clearly show that caching performance continued to improve as the level of cache cooperation was increased and refined.
Figure 5 shows the execution time and the number of disk reads for the matrix multiplication program, as the size of the matrices is increased, for each of the cache management policies.
The same file system configuration consisting of four cnodes and one ionode was used for all cases. The size of the matrices was carefully chosen to gradually become larger than the capacity of the ionode cache and a single cnode cache. It was expected that this would show the fundamental difference in the way each caching scheme operates.
When the dataset fits comfortably into the ionode cache and a single cnode cache, the base caching scheme performs nearly as well as as the cooperative caching algorithms. But as the dataset becomes larger, both base caching and greedy forwarding become ineffective. This is due to the repeating, sequential access pattern of the matrix multiplication program. By the time the last columns of the second matrix are read, the first columns, which must be re-read for multiplication with the next row, have been evicted from the cache. Centrally coordinated caching, however, performs extremely well. By eliminating data redundancy for a large portion of the file system's cache, all necessary data remains in the cache, resulting in compulsory misses only. Of course, once the data becomes too large to remain in the ionode cache and the centrally coordinated cache, performance becomes as poor as for the other caching schemes.
The poor performance of the globally managed caching scheme was quite unexpected, especially since it clearly outperformed all the other caching policies for the synthetic benchmark. The simulator was used to closely examine the program's actual data access pattern and the resulting cache behavior. It turned out that the program's excessive inter-process data sharing coupled with the sequentiality of using a single ionode works poorly with the globally managed caching's data placement policy (using the ionode cache to store evicted cnode cache blocks). Data tended to clump together in the cache of the last cnode to access the shared data, while the caches of the other cnodes remained relatively empty. Once this unfortunate cnode's cache became full, it would send evicted blocks to the ionode cache. Once the ionode cache became full, evicted blocks were discarded, even though many cnodes had space available within their caches. Solutions to this shortcoming are currently being investigated.
The matrix multiplication program was chosen for the study of the interaction between the cooperative caching and the collect I/O techniques since it benefits from both approaches. In our implementation of matrix multiplication, the majority of I/O operations involve reading the columns of the second matrix. Not only does reading a single column involve numerous I/O operations (since the matrices are stored on the disks in row-major order), but every column must be read in once for each row of the first matrix. Thus, for two NxN matices, reading in each column requires N operations and all N columns must be read in N times (once for each row of the first matrix), resulting in N3 operations directed to the file containing the second matrix. The file with the first matrix is hit with only N operations since each row is read in only once and may be read with a sinle operation since row elements are stored contiguously on the disk (except when a row happens to cross a disk block boundary).
Two different access patterns were used for reading the columns of the second matrix. In the original pattern, each target process began reading at a different column. This staggerring was done to allow some benefit from caching when all target processes read in the second matrix at the same time. The last columns needed by each process would already have been read in by other processes. The second access pattern has all target processes begin reading at column 0. This allows a collective operation to be used to eliminate redundant reads. Each column is read once and transmitted to all other target processes, though the columns are re-read when the target processes move on to the next set of rows from the first matrix.
The graphs presented earlier in Figure 5 show the execution time and number of disk reads for each cache technique using the original access pattern and no collective I/O. The graphs in Figure 6 show the performance with the second access pattern, though still without any collective I/O.
As expected, cache performance was significantly worse for this access pattern since all processes needed the same data at the same time.
Next, Figure 7 presents results using the simple Global Barrier Collective I/O algorithm.
This technique does not reduce the number of I/O operations and performs much like the non-collective experiment. When the collective I/O algorithm was changed to the Two-Phase I/O strategy, however, significant improvements were seen, as shown in Figure 8.
By eliminating the massive amount of redundancy when reading columns of the second matrix, Two-Phase I/O drastically reduces both the execution time and the number of disk reads. It also appears that reducing the number of I/O operations also directly benefits the performance of caching since even the simple caching strategies performed well for all but the largest matrix size. Also, while all caching techniques required the same number of disk reads for the smaller matrices (compulsory misses), the Centrally Coordinated cache had a higher execution time than the other strategies. This points out the tradeoff off inherent in Centrally Coordinated caching. While the global hit rate remained low, the local hit rate was higher due to the reduction in size of the cnode's locally managed cache. Few blocks had to be read from disk but many had to be retrieved from remote cnode caches, incurring additional overhead.
Figure 9 shows the performance of the NAS benchmarks as the number of ionodes in the system increases from 1 to 16. The same cache management policy (Base caching) was used in all cases.
Since the NAS benchmarks only perform write operations, there is no variation in performance for the different management policies. The one-processor problem does not benefit from the availability of more ionodes because the I/O requests performed by this benchmark are small enough to fit inside a single data block, allowing them to be serviced by a single ionode. However, when multiple processes are used to run the benchmark, more I/O requests are generated, allowing extra ionodes to relieve the congestion which results when all target processes try to access a single ionode. It can also be seen that when a given number of processors are available in the system, performance is best when the number of cnodes and ionodes is balanced. For example, with 17 processors, performance goes from worst to best when there there are: 1 cnode, 16 ionodes; 16 cnodes, 1 ionode; 4 cnodes, 13 ionodes; 9 cnodes, 8 ionodes.
A number of factors affect the performance of parallel I/O in a MPP system including architectural characteristics like the communictaion latency and number of ionodes; file system characteristics like caching policies, data distributions etc; implementation of collective I/O operations and device characteristics to name a few. Detailed simulation models of such programs can be computationally very expensive. We have developed a modular, extensible, parallel simulator to evaluate the performance of parallel I/O programs as a function of the preceding characteristics. The current version of the simulator can be used to simulate existing MPI-IO programs.
The paper presented the results of a number of performance studies using the simulator. The studies used applications from the NAS benchmark suite as well as synthetic benchmarks. The first set of experiments showed that for some examples, parallel execution can yield signifcant reductions in the execution time of the simulator itself. More detailed experimenrs with a larger class of benchmarks are in progress to evaluate the effectiveness of parallel execution in reducing the execution time for the simulator.
Experiments were also conducted to evaluate the impact of three different implementations of collective I/O operations: barrier, node grouping, and two-phase I/O. In the future, we plan to include disk-directed I/O in our investigations. These experiments used simple analytical models for the disk system that did not account sufficiently for the effect of data caching on application performance. A detailed set of models were developed to evaluate caching strategies including the cooperative caching schemes that were originally proposed for networks of workstations. As expected, the combination of using 2-phase I/O and data caching dramatically reduced the execution time for some of our applications. For a simple matrix multiplication example, the predicted execution time decreased from about five hours to 4 minutes for a relatively small matrix with 5000 elements.
Three approaches to collective I/O are discussed in [15]: traditional caching, two-phase I/O, and disk-directed I/O. Traditional caching does no collective I/O optimizations, since I/O requests are served as they arrive. These three methods were implemented and compared using the STARFISH [14] simulator, which is based on Proteus [3], a parallel architecture simulation engine.
In [1], a hybrid methodology for evaluating the performance of parallel I/O subsystems was done. PIOS, a trace-driven I/O simulator, is used to calculate the performance of the I/O system for a subset of the problem to be evaluated, while an analytical model was used for the remainder. PIOS uses a synthetically generated workload and models the Vulcan network [2].
Panda [23] provides a high-level collective I/O library interface. It implements server-directed I/O, which is disk-directed I/O at the logical file level, rather than the physical disk level. This method provides a high-level of portability by avoiding the difficult details of utilizing specific attributes for each underlying filesystem. Unfortunately, it may not produce as much performance as a true implementation of disk-directed I/O might. Also, Panda provides its own API to access its libraries. This may not be desirable since it is non-standard, which is one of the reasons why something such as MPI-IO has been proposed.
[12] presented a number of cooperative caching techniques, but only within the framework of a network environment. [8] extended cooperative caching to include parallel machines by allowing the speed of the interconnection network to be varied. However, all their results were obtained using the Sprite network workload, which may not be very representative of typical parallel machine workloads. [8] also introduced a cooperative caching strategy which eliminates the need for coherency mechanisms by coordinating the contents of one hundred percent of the system's file cache.
All the data presented in this paper was collected on the IBM-SP2 at UCLA's Office of Academic Computing, granted to UCLA by IBM Corporation under their Shared University Research Program and managed by Paul Hoffman. Special thanks to Sundeep Prakash and Punit Bhargava for their work on MPISIM and their valuable input during the simulator's extension to include I/O. Finally, we would also like to acknowledge those people from the Vesta and MPI-IO groups who took the time to answer our numerous questions.
2See http://www.research.ibm.com/people/p/prost/sections/mpiio.html
3A newer version, v0.2, of the NAS BTIO Parallel I/O Benchmarks is available. However, there are mostly syntactic differences, so it is still functionally equivalent to v0.1.
1. Sandra Johnson Baylor, Caroline Benveniste, and Leo J. Beolhouwer. A methodology for evaluating parallel I/O performance for massively parallel processors. Proceedings of the 27th Annual Simulation Symposium, pp.31-40, April 1994.
2. Sandra Johnson Baylor, Caroline Benveniste, and Yarsun Hsu. Performance evaluation of a massively parallel I/O subsystem. Ravi Jain, John Werth, and James C. Browne, editors, Input/Output in Parallel and Distributed Computer Systems, v.362 of The Kluwer International Series in Engineering and Computer Science, ch.13, pp.293-311. Kluwer Academic Publishers, 1996.
3. E.A. Brewer, C.N. Dellarocas, and W.E. Weihl. PROTEUS: A high-performance parallel-architecture simulator. Technical Report MIT/LCS/TR-516, Massachusetts Institute of Technology, Cmabridge, MA 02139, 1991.
4. D. Bailey, T. Harris, W. Saphir, R.v.d. Wijngaart, A. Woo, and M. Yarrow. The nas parallel benchmarks 2.0. Technical report nas-95-020, NASA Ames Research Center, Moffet Field, CA 94035-1000, December 1995.
5. R.G. Covington, S. Dwarkadas, J.R. Jump, J.B. Sinclair, and S. Madala. The efficient simulation of parallel computer systems. International Journal in Computer Simulation, 1:31-58, 1991.
6. Peter F. Corbett and Dror G. Feitelson. The Vesta parallel file system. ACM Transactions on Computer Systems, 14(3):225-264, August 1996.
7. Peter F. Corbett, Dror G. Feitelson, Jean-Pierre Prost, George S. Almasi, Sandra Johnson Baylor, Anthony S. Bolmarcich, Yarsun Hsu, Julian Satran, Marc Snir, Robert Colao, Brian Herr, Joseph Kavaky, Thomas R. Morgan, and Anthony Zlotek. Parallel file systems for the IBM SP computers. IBM Systems Journal, 34(2):222-248, January 1995.
8. Toni Cortes, Sergi Girona, and Jesus Labarta. Avoiding the cache-coherence problem in a parallel/distributed file system. Proceedings of the High-Performance Computing and Networking, pp.860-869, April 1997.
9. Juan Miguel del Rosario, Rajesh Bordawekar, and Alok Choudhary. Improved parallel I/O via a two-phase run-time access strategy. Proceedings of the IPPS '93 Workshop on Input/Output in Parallel Computer Systems, pp.56-70, Newport Beach, CA, 1993. Also published in Computer Architecture News 21(5), December 1993, pp.31-38.
10. H. Davis, S.R. Goldschmidt, and Hennessey. Multiprocessor simulation and tracing using Tango. Proceedings of the 1991 International Conference on Parallel Processing, pp.II99-II107, August 1991.
11. P. Dickens, P. Heidelberger, and D. Nicol. A distributed memory lapse: parallel simulation of message-passing programs. Workshop on Parallel and Distributed Simulation, pp.32-38, July 1994.
12. M.D. Dahlin, R.Y. Wang, T.E. Anderson, and D.A. Patterson. Cooperative caching: using remote client memory to improve file system performance. Proceedings of the 1994 Symposium on Operating Systems Design and Implementation, pp.61-74, November 1994.
13. MPI Forum, MPI: a message passing interface. Proceedings of the 1993 Supercomputing Conference, Portland, Washington, November 1993.
14. David Kotz. Tuning STARFISH. Technical Report PCS-TR96-296, Dept. of Computer Science, Dartmouth College, October 1996.
15. David Kotz. Disk-directed I/O for MIMD mulitprocessors. ACM Transactions on Computer Systems, 15(1):41-74, February 1997.
16. U. Legedza and W.E. Weihl. Reducing synchronization overhead in parallel simulation. 10th Workshop on Parallel and Distributed Simulation, pp.86-95, May 1996.
17. MPI-IO Committee. MPI-IO: a parallel file I/O interface for MPI. Version 0.5, April 1996. See http://lovelace.nas.nasa.gov/MPI-IO/mpi-io-report.0.5.ps.
18. Bill Nitzberg. Performance of the iPSC/806 Concurrent File System. Technical Report RND-92-020, NAS Systems Division, NASA Ames, December 1992.
19. Nils Nieuwejaar and David Kotz. The Galley parallel file system. Proceedings of the 10th ACM International Conference on Supercomputing, pp.374-381, Philadelphia, PA, May 1996.
20. Samuel A. Fineberg, Parkson Wong, Bill Nitzberg, and Chris Kuszaul. PMPIO-a portable implementation of MPI-IO. Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation, pp.188-195, October 1996.
21. Sundeep Prakash. Performance prediction of parallel programs. Ph.D. dissertation, Dept. of Computer Science, UCLA, Los Angeles, CA, November 1996.
22. S.K Reinhardt, M.D. Hill, J.R. Larus, A.R. Lebeck, J.C. Lewis, and D.A. Wood. The Wisconsin Wind Tunnel: virtual prototyping of parallel computers. Proceedings of the 1993 ACM SIGMETRICS Conference, May 1993.
23. K.E. Seamons, Y. Chen, P. Jones, J. Jozwiak, and M. Winslett. Server-directed collective I/O in Panda. Proceedings of Supercomputing '95,San Diego, CA, December 1995.
24. Darren Sanders, Yoonho Park, and Maciej Brodowicz. Implementation and performance of MPI-IO file access using MPI datatypes. Technical Report UH-CS-96-12, University of Houston, November 1996.