The definition variable implementation was compared with a simple barrier implementation for the synthetic benchmark shown in Figure 2.13. The translated SPMD program was run on an IBM SP2 using MPI [MPI93] for communication. Both implementations of the benchmark were also simulated using Maisie[Bag94]. In order to isolate the benefits of the definition variable method due to elimination of barrier synchronization from those due to elimination of barrier overhead, message passing time was kept as zero in the simulations.
Figure 2.13: Synthetic Benchmark
The parameters of the benchmark code are:
Results from the simulations are shown in Figure 2.14 and those from the IBM-SP2 are shown in Figure 2.15. In all the graphs, the Y-axis shows the improvement, in percent, of the definition variable method over the barrier method. Improvements were fairly independent of D, starting from D=100, hence all graphs have the same value of D. On the whole, it was observed that improvements are dependent on a complex range of factors, but proportional to two main factors, (a) the sparsity of the communication pattern and (b) the effective gap between the definition and use of a. The sparsity of the communication pattern is inversely proportional to the average percentage of the processors from which a processor receives requests in each iteration. Hence, pattern 1 is the most sparse, and pattern 5 is the most dense. The effective gap is the mean number of statements between the definition and the use of a. This is equal to . Given a particular problem size, as the number of processors P on which it is implemented increases, the effective gap decreases. Hence we have shown only a specific value of GAP since variation in improvement with effective gap is already captured.
Figure 2.14: Simulation Results
Figure 2.15: SP2 results
In the simulations of pattern 1, it is seen that as the number of processors increase the improvement increases. However, after a point, the effective gap becomes low for each problem size, and the improvements start stabilizing or decreasing. Clearly, the maximum improvement point is reached earlier for smaller problem sizes, hence N=32 peaks at 16 processors, while other problem sizes continue to improve. For the same pattern on the SP2, the smallest problem size N=16 shows significantly better improvement than N=1024 for all P. This is due to the elimination of barrier overhead, which was not a factor in the simulation. Barrier overhead is constant for all problem sizes, and increases as the number of processors. Hence, it constitutes a larger percentage of the total execution time in the barrier implementation of small problem sizes, and results in greater improvement upon its elimination. Pattern 2 is similar to pattern 1, but the improvements are less due to the increased synchronization over pattern 1. Pattern 3 is an unbalanced pattern, unlike all the others, which are balanced patterns. Hence, the sparsity of the pattern is different for different processors. The differences in sparsity account for the sudden decrease in performance on increasing P. In the simulation of pattern 3, for N=32, increasing P beyond 4 consistently worsens the performance. This is due to the fact that processor 0 is flooded with early requests, and hence gets slower, which result in more early requests. The effect is seen earlier for N=32 than N=256 because the effective gap becomes low correspondingly earlier. In the runs on the SP2, the same effect is seen, but the elimination of barrier overhead prevents this effect from making the overall improvement negative. The random communication pattern changes sparsity with the problem size and the number of processors, hence acts like pattern 5 when dense, and like pattern 2 when sparse.
On the whole, it is seen that the definition variable method is good for balanced sparse communication patterns with high effective gaps.