Table of Contents
Fetching ...

ActiveDC: Distribution Calibration for Active Finetuning

Wenshuai Xu, Zhenghui Hu, Yu Lu, Jinzhou Meng, Qingjie Liu, Yunhong Wang

TL;DR

ActiveDC addresses distribution bias in active finetuning by coupling data-selection that aligns the chosen subset with the full unlabeled pool with a distribution-calibration pipeline that exploits implicit category information in pretrained features. The method generates calibrated pseudo-features from tuned statistics and selects real data closest to these targets to extend the labeled pool, achieving state-of-the-art performance on CIFAR10, CIFAR100, and ImageNet-1k, especially at very low sampling budgets. The key contributions are the two-stage framework, Tukey-based distribution transformation, pseudo-category statistics, and a filtering step using Earth Mover's Distance, which together yield robust improvements and practical annotation savings. The work has implications for scalable, cost-effective model finetuning in vision tasks where labeling resources are scarce.

Abstract

The pretraining-finetuning paradigm has gained popularity in various computer vision tasks. In this paradigm, the emergence of active finetuning arises due to the abundance of large-scale data and costly annotation requirements. Active finetuning involves selecting a subset of data from an unlabeled pool for annotation, facilitating subsequent finetuning. However, the use of a limited number of training samples can lead to a biased distribution, potentially resulting in model overfitting. In this paper, we propose a new method called ActiveDC for the active finetuning tasks. Firstly, we select samples for annotation by optimizing the distribution similarity between the subset to be selected and the entire unlabeled pool in continuous space. Secondly, we calibrate the distribution of the selected samples by exploiting implicit category information in the unlabeled pool. The feature visualization provides an intuitive sense of the effectiveness of our approach to distribution calibration. We conducted extensive experiments on three image classification datasets with different sampling ratios. The results indicate that ActiveDC consistently outperforms the baseline performance in all image classification tasks. The improvement is particularly significant when the sampling ratio is low, with performance gains of up to 10%. Our code will be released.

ActiveDC: Distribution Calibration for Active Finetuning

TL;DR

ActiveDC addresses distribution bias in active finetuning by coupling data-selection that aligns the chosen subset with the full unlabeled pool with a distribution-calibration pipeline that exploits implicit category information in pretrained features. The method generates calibrated pseudo-features from tuned statistics and selects real data closest to these targets to extend the labeled pool, achieving state-of-the-art performance on CIFAR10, CIFAR100, and ImageNet-1k, especially at very low sampling budgets. The key contributions are the two-stage framework, Tukey-based distribution transformation, pseudo-category statistics, and a filtering step using Earth Mover's Distance, which together yield robust improvements and practical annotation savings. The work has implications for scalable, cost-effective model finetuning in vision tasks where labeling resources are scarce.

Abstract

The pretraining-finetuning paradigm has gained popularity in various computer vision tasks. In this paradigm, the emergence of active finetuning arises due to the abundance of large-scale data and costly annotation requirements. Active finetuning involves selecting a subset of data from an unlabeled pool for annotation, facilitating subsequent finetuning. However, the use of a limited number of training samples can lead to a biased distribution, potentially resulting in model overfitting. In this paper, we propose a new method called ActiveDC for the active finetuning tasks. Firstly, we select samples for annotation by optimizing the distribution similarity between the subset to be selected and the entire unlabeled pool in continuous space. Secondly, we calibrate the distribution of the selected samples by exploiting implicit category information in the unlabeled pool. The feature visualization provides an intuitive sense of the effectiveness of our approach to distribution calibration. We conducted extensive experiments on three image classification datasets with different sampling ratios. The results indicate that ActiveDC consistently outperforms the baseline performance in all image classification tasks. The improvement is particularly significant when the sampling ratio is low, with performance gains of up to 10%. Our code will be released.
Paper Structure (12 sections, 10 equations, 5 figures, 6 tables, 1 algorithm)

This paper contains 12 sections, 10 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of the performance for finetuning the CIFAR100 dataset at different sampling ratios.
  • Figure 2: The active finetuning task involves the active selection of training data for finetuning within the pretraining-finetuning paradigm. We focus on data selection and distribution calibration from a large unlabeled data pool for annotation. The Distributed Calibration Module comprises four main steps: (1) applying Tukey's Ladder of Powers Transformation to render the feature distribution more Gaussian-like, (2) clustering the features and calibrating the statistics for different feature classes, (3) generating pseudo-features using the calibrated statistics and identifying the most similar real features, and (4) filtering and integrating the features into the extended labeled pool.
  • Figure 3: The diagram showcases two contrasting scenarios in the finetuning of a pretrained model. On the left, finetuning with a limited number of sample features leads to model overfitting. On the right, employing features sampled from a calibration distribution for finetuning the pretrained model demonstrates improved generalization.
  • Figure 4: t-SNE Embeddings of CIFAR10: We visualize the embedding of selected samples labeled by the oracle (represented by a pentagram) and distribution calibration samples via ActiveDC (represented by a triangular shape) at a sampling ratio of $0.1$%. Best viewed in color.
  • Figure 5: The effect of $\lambda$: The top fold (in red) represents finetuning accuracy with statistical calibration, while the lower fold (in blue) represents finetuning accuracy without statistical calibration. These results are obtained using different values of hyperparameter $\lambda$ in \ref{['eq:Tukey']}.