We present the results of a performance assessment and optimisation work regarding the CUDA implementation of the three-dimensional XCA-Flow subsurface Extended Cellular Automata model. To this end, we have considered a ten days long simulation already considered in previous works, characterized by a constant infiltration rate and a heterogeneous hydraulic conductivity field, as the benchmark. We ran the experiments on the Nvidia V100 high-performance many-core device. We have analysed essential aspects of the XCA-Flow model by updating its kernels. We applied classical tiling/shared memory techniques to the stencil-based and reduction kernels in the first step. Results suggested applying a thorough analysis of the model. Both theoretical and experimental assessments have driven this analysis, which has pointed out the need to increase the achieved warp occupancy to speed up the computation. The resulting general redesign of the application allowed for a 20.3% mean performance gain (over the CUDA block configurations considered). We also performed two Roofline analyses to characterise the kernels of the original and improved implementations in terms of arithmetic intensity and performance. Besides the improved performance, we have obtained meaningful insights about the CUDA implementation of the XCA-Flow model that could, in principle, allow for further optimisations.
Performance Analysis and Optimization of the CUDA Implementation of the Three-Dimensional Subsurface XCA-Flow Cellular Automaton
De Rango A.;Furnari L.;Senatore A.;Mendicino G.;D'Ambrosio D.
2023-01-01
Abstract
We present the results of a performance assessment and optimisation work regarding the CUDA implementation of the three-dimensional XCA-Flow subsurface Extended Cellular Automata model. To this end, we have considered a ten days long simulation already considered in previous works, characterized by a constant infiltration rate and a heterogeneous hydraulic conductivity field, as the benchmark. We ran the experiments on the Nvidia V100 high-performance many-core device. We have analysed essential aspects of the XCA-Flow model by updating its kernels. We applied classical tiling/shared memory techniques to the stencil-based and reduction kernels in the first step. Results suggested applying a thorough analysis of the model. Both theoretical and experimental assessments have driven this analysis, which has pointed out the need to increase the achieved warp occupancy to speed up the computation. The resulting general redesign of the application allowed for a 20.3% mean performance gain (over the CUDA block configurations considered). We also performed two Roofline analyses to characterise the kernels of the original and improved implementations in terms of arithmetic intensity and performance. Besides the improved performance, we have obtained meaningful insights about the CUDA implementation of the XCA-Flow model that could, in principle, allow for further optimisations.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.