Supervised machine learning often requires a significant volume of labeled training data, incurring substantial costs for data annotation. In scenarios with limited labeling budgets, selecting the most informative instances for labeling by an annotation oracle becomes crucial. Active learning addresses this challenge by strategically choosing informative instances for labeling, thereby maximizing model performance with limited labeled data. Existing active learning methods, however, typically do not fully exploit abundant unlabeled data that can be used to extract meaningful features from raw data. While some methods integrate variational autoencoders (VAEs) into active learning, this work introduces a novel framework that does not use VAEs merely to assist in the selection of data for the oracle. Instead, our approach leverages the latent space learned by the VAE to heuristically annotate unlabeled data through a k-nearest neighbor classifier within this space. The proposed approach allows to enhance existing active learning methods without relying solely on an annotation oracle, thus reducing the overall annotation cost. Experiments on benchmark datasets show that our proposal can improve the performance of existing active learning methods by up to 33% in terms of classification accuracy and by up to 0.38 in terms of F1-score when the initial labeled data is extremely limited. We make source code and evaluation data available at https://github.com/Franco7Scala/Laken.

Enhancing active learning through latent space exploration: A k-nearest neighbors approach

Flesca S.;Mandaglio D.;Scala F.
2025-01-01

Abstract

Supervised machine learning often requires a significant volume of labeled training data, incurring substantial costs for data annotation. In scenarios with limited labeling budgets, selecting the most informative instances for labeling by an annotation oracle becomes crucial. Active learning addresses this challenge by strategically choosing informative instances for labeling, thereby maximizing model performance with limited labeled data. Existing active learning methods, however, typically do not fully exploit abundant unlabeled data that can be used to extract meaningful features from raw data. While some methods integrate variational autoencoders (VAEs) into active learning, this work introduces a novel framework that does not use VAEs merely to assist in the selection of data for the oracle. Instead, our approach leverages the latent space learned by the VAE to heuristically annotate unlabeled data through a k-nearest neighbor classifier within this space. The proposed approach allows to enhance existing active learning methods without relying solely on an annotation oracle, thus reducing the overall annotation cost. Experiments on benchmark datasets show that our proposal can improve the performance of existing active learning methods by up to 33% in terms of classification accuracy and by up to 0.38 in terms of F1-score when the initial labeled data is extremely limited. We make source code and evaluation data available at https://github.com/Franco7Scala/Laken.
2025
Active learning
Latent space
Pseudo-labeling
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11770/399962
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact