Missing data imputation is a technique to deal with incomplete datasets. Since many models and algorithms cannot be applied to data containing missing values, a pre-processing step needs to be performed to remove incomplete data or to estimate the missing values. This is a well-known problem referred to as the data imputation problem. Several approaches have been designed for data imputation. These algorithms can be divided into two main categories: statistical and machine learning-based algorithms. As machine learning algorithms are optimized, they usually have better performance compared with statistical ones. In this paper, we review the most recent literature related to missing data imputation based on generative adversarial networks (GANs) that have gained tremendous attention in dealing with missing values. We examine the structures of GANs for missing data imputation and discuss the commonly used datasets and metrics for evaluation. We also cover the influence of the missing datatype, the effect of the missing data fraction, and the algorithm-related problems on data imputation performance. We conduct experiments on two publicly available datasets and evaluate the performance of GAIN, a missing data imputation algorithm to that of existing state-of-the-art approaches, demonstrating that the GAN-based algorithm outperforms the others in terms of RMSE and FID.

Generative Adversarial Networks Assist Missing Data Imputation: A Comprehensive Survey and Evaluation

Shahbazian R.
;
Greco S.
2023-01-01

Abstract

Missing data imputation is a technique to deal with incomplete datasets. Since many models and algorithms cannot be applied to data containing missing values, a pre-processing step needs to be performed to remove incomplete data or to estimate the missing values. This is a well-known problem referred to as the data imputation problem. Several approaches have been designed for data imputation. These algorithms can be divided into two main categories: statistical and machine learning-based algorithms. As machine learning algorithms are optimized, they usually have better performance compared with statistical ones. In this paper, we review the most recent literature related to missing data imputation based on generative adversarial networks (GANs) that have gained tremendous attention in dealing with missing values. We examine the structures of GANs for missing data imputation and discuss the commonly used datasets and metrics for evaluation. We also cover the influence of the missing datatype, the effect of the missing data fraction, and the algorithm-related problems on data imputation performance. We conduct experiments on two publicly available datasets and evaluate the performance of GAIN, a missing data imputation algorithm to that of existing state-of-the-art approaches, demonstrating that the GAN-based algorithm outperforms the others in terms of RMSE and FID.
2023
Generative adversarial networks
missing data
data imputation
semi-supervised learning
data cleaning
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11770/361245
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? 1
social impact