Table of Contents
Fetching ...

Unsupervised Active Learning via Natural Feature Progressive Framework

Yuxi Liu, Catherine Lalman, Yimin Yang

TL;DR

The paper tackles the data labeling bottleneck in deep learning by proposing Unsupervised Active Learning via the Natural Feature Progressive Framework (NFPF). It combines a Reconstruction Difference–based seed initialization with a lightweight Specific Feature Learning Machine to measure sample learnability through inter-model discrepancy, enabling a progressive, one-shot subset selection without backpropagation across rounds. Empirical results across nine datasets show NFPF outperforms existing UAL methods and approaches supervised AL on vision tasks, with substantial reductions in labeling and training steps (notably 7x–20x fewer steps on CIFAR-100). Ablation studies and visualizations corroborate robustness, distribution coverage, and informative sampling, highlighting NFPF’s practical impact for cost-efficient large-scale learning.

Abstract

The effectiveness of modern deep learning models is predicated on the availability of large-scale, human-annotated datasets, a process that is notoriously expensive and time-consuming. While Active Learning (AL) offers a strategic solution by labeling only the most informative and representative data, its iterative nature still necessitates significant human involvement. Unsupervised Active Learning (UAL) presents an alternative by shifting the annotation burden to a single, post-selection step. Unfortunately, prevailing UAL methods struggle to achieve state-of-the-art performance. These approaches typically rely on local, gradient-based scoring for sample importance estimation, which not only makes them vulnerable to ambiguous and noisy data but also hinders their capacity to select samples that adequately represent the full data distribution. Moreover, their use of shallow, one-shot linear selection falls short of a true UAL paradigm. In this paper, we propose the Natural Feature Progressive Framework (NFPF), a UAL method that revolutionizes how sample importance is measured. At its core, NFPF employs a Specific Feature Learning Machine (SFLM) to effectively quantify each sample's contribution to model performance. We further utilize the SFLM to define a powerful Reconstruction Difference metric for initial sample selection. Our comprehensive experiments show that NFPF significantly outperforms all established UAL methods and achieves performance on par with supervised AL methods on vision datasets. Detailed ablation studies and qualitative visualizations provide compelling evidence for NFPF's superior performance, enhanced robustness, and improved data distribution coverage.

Unsupervised Active Learning via Natural Feature Progressive Framework

TL;DR

The paper tackles the data labeling bottleneck in deep learning by proposing Unsupervised Active Learning via the Natural Feature Progressive Framework (NFPF). It combines a Reconstruction Difference–based seed initialization with a lightweight Specific Feature Learning Machine to measure sample learnability through inter-model discrepancy, enabling a progressive, one-shot subset selection without backpropagation across rounds. Empirical results across nine datasets show NFPF outperforms existing UAL methods and approaches supervised AL on vision tasks, with substantial reductions in labeling and training steps (notably 7x–20x fewer steps on CIFAR-100). Ablation studies and visualizations corroborate robustness, distribution coverage, and informative sampling, highlighting NFPF’s practical impact for cost-efficient large-scale learning.

Abstract

The effectiveness of modern deep learning models is predicated on the availability of large-scale, human-annotated datasets, a process that is notoriously expensive and time-consuming. While Active Learning (AL) offers a strategic solution by labeling only the most informative and representative data, its iterative nature still necessitates significant human involvement. Unsupervised Active Learning (UAL) presents an alternative by shifting the annotation burden to a single, post-selection step. Unfortunately, prevailing UAL methods struggle to achieve state-of-the-art performance. These approaches typically rely on local, gradient-based scoring for sample importance estimation, which not only makes them vulnerable to ambiguous and noisy data but also hinders their capacity to select samples that adequately represent the full data distribution. Moreover, their use of shallow, one-shot linear selection falls short of a true UAL paradigm. In this paper, we propose the Natural Feature Progressive Framework (NFPF), a UAL method that revolutionizes how sample importance is measured. At its core, NFPF employs a Specific Feature Learning Machine (SFLM) to effectively quantify each sample's contribution to model performance. We further utilize the SFLM to define a powerful Reconstruction Difference metric for initial sample selection. Our comprehensive experiments show that NFPF significantly outperforms all established UAL methods and achieves performance on par with supervised AL methods on vision datasets. Detailed ablation studies and qualitative visualizations provide compelling evidence for NFPF's superior performance, enhanced robustness, and improved data distribution coverage.

Paper Structure

This paper contains 19 sections, 13 equations, 12 figures, 4 tables, 1 algorithm.

Figures (12)

  • Figure 1: Fast convergence on the large-scale classification dataset CIFAR-100. The Natural Feature Progressive Framework (NFPF) requires fewer gradient steps to train the classifier compared with standard uniform data selection. PSS-AL denotes the Unsupervised Projected Sample Selector pi2024unsupervised, and DUAL denotes On Deep Unsupervised Active Learning ijcai2020p364.
  • Figure 2: Active and Unlabeled Active Learning (UAL) schemes. UAL aims to select an informative and representative subset without ground-truth labels, minimizing human annotation. (a) Active Learning process, (b) classical UAL methods, (c) our NFPF scheme. Our proposed NFPF scheme follows an identical procedure to Active Learning, with the exception of the iterative human annotation step.
  • Figure 3: The overall architecture of the proposed NFPF, showing its two main phases: a subset initialization phase and the UAL learning phase. Dashed arrows indicate the training process, while solid lines denote the direction of data flow. (a) Subset $\mathbf{X}_S^0$ initialization by Reconstruction Difference (RD) algorithm: In this phase, we first train the Specific Feature Learning Machine (SFLM) module on a core set of $C$. The trained module is then applied to all unlabeled data to compute their RD scores, enabling the ranking and selection of the initial subset. (b) NFPF scheme: The reference SFLM model is first trained on all data initially without label. The current SFLM model is trained on $\mathbf{X}_S^0$ from (a). In each learning cycle, ➀ calculate reconstruction loss for both reference and current model. ➁ rank the score and select top-$n$ samples into the candidate pool $\mathbf{X}_S^t$. ➂ train the current model for next cycle. This process is iterated $t$ times until the subset reaches the target size $m$. The accumulated selected samples are subsequently forwarded to the oracle for annotation and used for downstream classification tasks.
  • Figure 4: Procedure of subset initialization: (a) Ground Truth; (b) Three core categories (red, blue, green) are selected to train the SFLM model; (c) $\mathbf{X}_S^0$ results ($k=10$). Some samples that lie close to two cores are highlighted in magenta, cyan, or teal, while gray points denote remaining data.
  • Figure 5: Idea of the cluster growing with reconstructed samples; The key is to design a fast but "weak" autoencoder
  • ...and 7 more figures