Table of Contents
Fetching ...

Active Learning with Task-Driven Representations for Messy Pools

Kianoosh Ashouritaklimi, Tom Rainforth

TL;DR

This work tackles active learning in messy pools where fixed unsupervised representations can miss task-relevant information. It introduces task-driven representations that are periodically updated during AL, with two concrete strategies: a Split Representation Approach inspired by CCVAE and a Representation Fine-Tuning Approach that adaptively tunes a pretrained encoder using acquired labels. Empirically, the proposed methods (TD-SPLIT and TD-FT) significantly improve acquisition quality and final accuracy across F+MNIST, CIFAR-10+100, and CheXpert, outperforming unsupervised and transfer-learning baselines, while EPIG remains a strong, complementary acquisition strategy. The results highlight the importance of aligning representation learning with the downstream task in AL and show practical benefits for real-world messy data scenarios; the probabilistic modeling framework underpins the approach and supports robust uncertainty estimation during sequential labeling.

Abstract

Active learning has the potential to be especially useful for messy, uncurated pools where datapoints vary in relevance to the target task. However, state-of-the-art approaches to this problem currently rely on using fixed, unsupervised representations of the pool, focusing on modifying the acquisition function instead. We show that this model setup can undermine their effectiveness at dealing with messy pools, as such representations can fail to capture important information relevant to the task. To address this, we propose using task-driven representations that are periodically updated during the active learning process using the previously collected labels. We introduce two specific strategies for learning these representations, one based on directly learning semi-supervised representations and the other based on supervised fine-tuning of an initial unsupervised representation. We find that both significantly improve empirical performance over using unsupervised or pretrained representations.

Active Learning with Task-Driven Representations for Messy Pools

TL;DR

This work tackles active learning in messy pools where fixed unsupervised representations can miss task-relevant information. It introduces task-driven representations that are periodically updated during AL, with two concrete strategies: a Split Representation Approach inspired by CCVAE and a Representation Fine-Tuning Approach that adaptively tunes a pretrained encoder using acquired labels. Empirically, the proposed methods (TD-SPLIT and TD-FT) significantly improve acquisition quality and final accuracy across F+MNIST, CIFAR-10+100, and CheXpert, outperforming unsupervised and transfer-learning baselines, while EPIG remains a strong, complementary acquisition strategy. The results highlight the importance of aligning representation learning with the downstream task in AL and show practical benefits for real-world messy data scenarios; the probabilistic modeling framework underpins the approach and supports robust uncertainty estimation during sequential labeling.

Abstract

Active learning has the potential to be especially useful for messy, uncurated pools where datapoints vary in relevance to the target task. However, state-of-the-art approaches to this problem currently rely on using fixed, unsupervised representations of the pool, focusing on modifying the acquisition function instead. We show that this model setup can undermine their effectiveness at dealing with messy pools, as such representations can fail to capture important information relevant to the task. To address this, we propose using task-driven representations that are periodically updated during the active learning process using the previously collected labels. We introduce two specific strategies for learning these representations, one based on directly learning semi-supervised representations and the other based on supervised fine-tuning of an initial unsupervised representation. We find that both significantly improve empirical performance over using unsupervised or pretrained representations.

Paper Structure

This paper contains 59 sections, 3 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Test accuracy for EPIG with unsupervised representations on F+MNIST (left) and CIFAR-10+100 (right) (see §\ref{['experiments']}) under increasing levels of pool "messiness", namely decreasing the number of pool samples which are of the classes of interest. All experiments were run for 4 seeds, solid line shows mean and shading $\pm 1$ standard error.
  • Figure 2: Test accuracy on CIFAR-10+100 using the US approach, our task-driven TD-FT approach, and a TRANSFER learning approach. All experiments were run for 4 seeds. Solid line shows mean and shading $\pm 1$ standard error.
  • Figure 3: Test accuracy on F+MNIST using the US approach and our task--driven approach. Top row shows the results using VAE--based encoders and the bottom row shows the results for SimCLRv2 encoders. Experiments run for 4 seeds.
  • Figure 4: Test accuracy on CheXpert using the US approach and our task--driven approach. Top row shows the results using VAE--based encoders and the bottom row shows the results for SimCLRv2 encoders. Experiments run for 4 seeds.
  • Figure 5: Test accuracy for our TD-SPLIT approach on F+MNIST, CIFAR-10+100, CheXpert and the baselines considered in Table \ref{['tab:comparisons']}. All experiments were run for 4 seeds. Solid line shows mean and shading $\pm 1$ standard error.
  • ...and 5 more figures