Reasoning on the gap between synthetic and authentic diversity: The limits of computational solutions to representation bias

IRIS

The severe underrepresentation of darker skin tones in dermatological training datasets perpetuates critical healthcare disparities in melanoma detection. Dermatological AI tools that are trained on predominantly light-skinned datasets show dramatic performance degradation on darker skin tones with diagnostic accuracy for melanoma that plummets from 92% to 56%. The Pipsqueak dataset, presented in Ruga et al. (2025), highlighted that there exist fewer than 20 diagnostic-quality melanoma images from Fitzpatrick skin types V-VI across public available datasets. The ideal solutions is collecting real data, but this would require years and in the meantime homogeneous algorithms continue clinical deployment today. This paper introduces: the HAM-SyntheticDarker dataset, a synthetic dataset generated through controlled color-luminosity matching, and the HAM-HybridEquity dataset, obtained combining a real and a synthetic dataset, to embrace equity. This study extends the paper in Ruga et al. (2025) and conducts a series of experiments, using the MultiExCam framework (Ruga et al., 2026), to document the benefits and limitations of synthetic diversity and how it can overcome bias and promote fairness. The achieved results highlight that synthetic diversity cannot substitute authentic diversity. The paper also reveals how models trained on darker skin generalize better to lighter skin than converse, revealing directional representation biases and an empirical evidence that synthetic diversity is supplementary rather than substitutive, as it offers modest interim improvements. Therefore, the ethical imperative remains: developing dermatological imaging datasets that represent the full spectrum of human skin diversity.

Reasoning on the gap between synthetic and authentic diversity: The limits of computational solutions to representation bias

Ruga, Tommaso;Zumpano, Ester;Vocaturo, Eugenio;Caroprese, Luciano

2026-01-01

Abstract

The severe underrepresentation of darker skin tones in dermatological training datasets perpetuates critical healthcare disparities in melanoma detection. Dermatological AI tools that are trained on predominantly light-skinned datasets show dramatic performance degradation on darker skin tones with diagnostic accuracy for melanoma that plummets from 92% to 56%. The Pipsqueak dataset, presented in Ruga et al. (2025), highlighted that there exist fewer than 20 diagnostic-quality melanoma images from Fitzpatrick skin types V-VI across public available datasets. The ideal solutions is collecting real data, but this would require years and in the meantime homogeneous algorithms continue clinical deployment today. This paper introduces: the HAM-SyntheticDarker dataset, a synthetic dataset generated through controlled color-luminosity matching, and the HAM-HybridEquity dataset, obtained combining a real and a synthetic dataset, to embrace equity. This study extends the paper in Ruga et al. (2025) and conducts a series of experiments, using the MultiExCam framework (Ruga et al., 2026), to document the benefits and limitations of synthetic diversity and how it can overcome bias and promote fairness. The achieved results highlight that synthetic diversity cannot substitute authentic diversity. The paper also reveals how models trained on darker skin generalize better to lighter skin than converse, revealing directional representation biases and an empirical evidence that synthetic diversity is supplementary rather than substitutive, as it offers modest interim improvements. Therefore, the ethical imperative remains: developing dermatological imaging datasets that represent the full spectrum of human skin diversity.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2026
			
	Parole chiave
	
				Dataset bias
Melanoma classification
Skin tone diversity
Synthetic data conversion
Transfer learning
			
	Appare nelle tipologie:
	
				1.1 Articolo in rivista

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11770/405164

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

0

0

social impact