next up previous
Next: Conclusion Up: Experiments and Results Previous: Effect of Collective

Effect of Caching Policies

Figures 3 and 4 show the performance for each of the cache management policies while running the synthetic workload. The exact workload consisted of 5000 warm-up requests and 5000 measured requests, with each request generally consisting of between one and four data blocks. The percentage of read requests was set to eighty percent. Figure 3 shows the execution time of the benchmark as the number of ionodes in the system is increased from 1 to 16 (the number of cnodes is held constant, either 4 or 8). Figure 4 shows the number of data blocks which had to be actually read from the disk for the same workload and file system configurations. Reads were chosen since this is where the major difference in performance of the various cache management policies can be seen. To eliminate any other variations due to a difference in the handling of write requests, the experiments were run using write-through and write-around, for which all cache policies behave the same. For each set of graphs, two scenarios were examined. One in which each ionode contained a 128 block cache, thus the aggregate size of the ionode cache increased as more ionodes were added. To separate the benefit of having more ionode cache from the benefit of having more disks available to perform I/O, the total I/O cache size was also kept constant at 512 blocks as more ionodes were added. So, when there are 4 ionodes, each has a 128 block cache, when there are 16 ionodes, each has a 32 block cache. These results clearly show that caching performance continued to improve as the level of cache cooperation was increased and refined.

Figure 5 shows the performance of the NAS benchmarks as the number of ionodes in the system increases from 1 to 16. The same cache management policy (Base caching) was used in all cases. Since the NAS benchmarks only perform write operations, there is no variation in performance for the different management policies. The one-processor problem does not benefit from the availability of more ionodes because the I/O requests performed by this benchmark are small enough to fit inside a single data block, allowing them to be serviced by a single ionode. However, when multiple processes are used to run the benchmark, more I/O requests are generated, allowing extra ionodes to relieve the congestion which results when all target processes try to access a single ionode. It can also be seen that when a given number of processors are available in the system, performance is best when the number of cnodes and ionodes is balanced. For example, with 17 processors, performance goes from worst to best when there there are: 1 cnode, 16 ionodes; 16 cnodes, 1 ionode; 4 cnodes, 13 ionodes; 9 cnodes, 8 ionodes.

Figure 6 shows the execution time and the number of disk reads for the matrix multiplication program, as the size of the matrices is increased, with each of the cache management policies. The same file system configuration consisting of four cnodes and one ionode was used for all cases. The size of the matrices was carefully chosen to gradually become larger than the capacity of the ionode cache and a single cnode cache. It was expected that this would show the fundamental difference in the way each caching scheme operates. When the dataset fits comfortably into these two caches, the base caching scheme performs nearly as well as as the cooperative caching algorithms. But as the dataset becomes larger, both base caching and greedy forwarding become ineffective. This is due to the repeating, sequential access pattern of the matrix multiplication program. By the time the last columns the second matrix are read, the first columns, which must be re-read for multiplication with the next row, have been evicted from the cache. Centrally coordinated caching performs extremely well. By eliminating data redundancy for a large portion of the file system's cache, all necessary data remains in the cache, resulting in compulsory misses only. Of course, once the data becomes to large to remain in the ionode cache and the centrally coordinated cache, performance would become as poor as for the other caching schemes.

The poor performance of the globally managed caching scheme was quite unexpected, especially since it clearly outperformed all the other caching policies for the synthetic benchmark. The simulator was used to closely examine the program's actual data access pattern and the resulting cache behavior. It turned out that the program's excessive inter-process data sharing coupled with the sequentality of using a single ionode works poorly with the globally managed caching's non-redundancy policy. Data tended to clump together in the cache of the last cnode to access the shared data, while the caches of the other cnodes remained relatively empty. Once this unfortunate cnode's cache became full, it would send evicted blocks to the ionode cache. Once the ionode cache became full, evicted blocks were discarded, even though many cnodes had space available within their caches.



next up previous
Next: Conclusion Up: Experiments and Results Previous: Effect of Collective



Andy Kahn
Tue Jun 24 17:48:10 PDT 1997