Nowadays, performance in HPC applications focuses on MPI efficiency as the de facto message-passing library to exploit parallelism. Features such as multithread and communication and processing overlap are continuously studied to adapt to new platforms and a more significant number of processing units like GPU platforms. In this sense, recently, the MPI-4.0 standard introduced the partitioned point-to-point communication primitives to potentiate computation and communication overlapping. This paper introduces an innovative extension to MPI, specifically addressing partitioned communication for MPI-reduction primitives. Traditional reduction tasks conventionally involve processing the complete input vector following the conclusion of GPU computations. In contrast, our proposed methodology exploits message partitioning to process reduction tasks in real-time incrementally. This approach allows the system to process individual partitions of the input vector as they become available, removing the necessity to await the full completion of GPU computations before initiating the reduction. Our results demonstrate promising benefits, particularly for large message sizes. However, it is essential to acknowledge that optimizations at synchronization points remain potential bottlenecks, requiring meticulous analysis and consideration.

Partitioned Reduction for Heterogeneous Environments

De Rango A.;D'ambrosio D.;Mendicino G.
2024-01-01

Abstract

Nowadays, performance in HPC applications focuses on MPI efficiency as the de facto message-passing library to exploit parallelism. Features such as multithread and communication and processing overlap are continuously studied to adapt to new platforms and a more significant number of processing units like GPU platforms. In this sense, recently, the MPI-4.0 standard introduced the partitioned point-to-point communication primitives to potentiate computation and communication overlapping. This paper introduces an innovative extension to MPI, specifically addressing partitioned communication for MPI-reduction primitives. Traditional reduction tasks conventionally involve processing the complete input vector following the conclusion of GPU computations. In contrast, our proposed methodology exploits message partitioning to process reduction tasks in real-time incrementally. This approach allows the system to process individual partitions of the input vector as they become available, removing the necessity to await the full completion of GPU computations before initiating the reduction. Our results demonstrate promising benefits, particularly for large message sizes. However, it is essential to acknowledge that optimizations at synchronization points remain potential bottlenecks, requiring meticulous analysis and consideration.
2024
distributed computing
GPU programming
MPI
partitioned communication
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11770/376402
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact