Scalable Compression of Massive Data Collections on HPC Systems

Belcastro, Loris; Ferragina, Paolo; Manzini, Giovanni; Marozzo, Fabrizio; Talia, Domenico; Trunfio, Paolo

doi:10.1007/978-3-031-99857-7_23

The exponential growth of digital data poses a significant storage challenge, straining current storage systems in terms of cost, efficiency, maintainability, and available resources. For large-scale data archiving, highly efficient data compression techniques are vital for minimizing storage overhead, communication efficiency, and optimizing data retrieval performance. This paper presents a scalable parallel workflow designed to compress vast collections of files on high-performance computing systems. Leveraging the Permute-Partition-Compress (PPC) paradigm, the proposed workflow optimizes both compression ratio and processing speed. By integrating a data clustering technique, our solution effectively addresses the challenges posed by large-scale data collections in terms of compression efficiency and scalability. Experiments were conducted on the Leonardo petascale supercomputer of CINECA (leonardo-supercomputer.cineca.eu), and processed a subset of the Software Heritage archive, consisting of about 49 million files of C++ code, totaling 1.1 TB of space. Experimental results show significant performance in both compression speedup and scalability.

Scalable Compression of Massive Data Collections on HPC Systems

Belcastro, Loris;Ferragina, Paolo;Manzini, Giovanni;Marozzo, Fabrizio;Talia, Domenico;Trunfio, Paolo

2026-01-01

Abstract

The exponential growth of digital data poses a significant storage challenge, straining current storage systems in terms of cost, efficiency, maintainability, and available resources. For large-scale data archiving, highly efficient data compression techniques are vital for minimizing storage overhead, communication efficiency, and optimizing data retrieval performance. This paper presents a scalable parallel workflow designed to compress vast collections of files on high-performance computing systems. Leveraging the Permute-Partition-Compress (PPC) paradigm, the proposed workflow optimizes both compression ratio and processing speed. By integrating a data clustering technique, our solution effectively addresses the challenges posed by large-scale data collections in terms of compression efficiency and scalability. Experiments were conducted on the Leonardo petascale supercomputer of CINECA (leonardo-supercomputer.cineca.eu), and processed a subset of the Software Heritage archive, consisting of about 49 million files of C++ code, totaling 1.1 TB of space. Experimental results show significant performance in both compression speedup and scalability.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2026
			
	Codice ISBN
	
				9783031998560
9783031998577
			
	Parole chiave
	
				Big Data
Data Compression
Distributed Processing
HPC
Parallel Computing
			
	Appare nelle tipologie:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11770/399598

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

0

0

Scalable Compression of Massive Data Collections on HPC Systems

Belcastro, Loris;Ferragina, Paolo;Manzini, Giovanni;Marozzo, Fabrizio;Talia, Domenico;Trunfio, Paolo

2026-01-01

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Informazioni

Attenzione

Citazioni

social impact

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)