Table of Contents
Fetching ...

VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness

Rongyu Zhang, Zefan Cai, Huanrui Yang, Zidong Liu, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Baobao Chang, Yuan Du, Li Du, Shanghang Zhang

TL;DR

VeCAF tackles the inefficiency of finetuning pretrained vision models by jointly selecting informative data with objective-awareness and enriching image representations through language-guided cues. It formalizes an Objective-aware Data Selection (ODS) mechanism that minimises $D_{KL}(p_{\mathcal{L}}||p_S) - \lambda R(p_S)$ using an ensemble of centroid-based selectors, and couples this with Cross-attentive Embedding Augmentation (CEA) that fuses caption-derived text embeddings into image features via $e^{aug}_i = e_i - \eta \alpha_i (e_i - t_i)$. The approach yields substantial efficiency gains (e.g., up to $3.3\times$ fewer training batches on ImageNet) and improves accuracy against strong baselines, with robust performance on out-of-distribution data through target-domain caption augmentation. The results demonstrate VeCAF’s versatility across diverse PVMs and language encoders, confirming the practical value of integrating training-objective awareness and language-grounded signals into finetuning workflows.

Abstract

Finetuning a pretrained vision model (PVM) is a common technique for learning downstream vision tasks. However, the conventional finetuning process with randomly sampled data points results in diminished training efficiency. To address this drawback, we propose a novel approach, Vision-language Collaborative Active Finetuning (VeCAF). With the emerging availability of labels and natural language annotations of images through web-scale crawling or controlled generation, VeCAF makes use of these information to perform parametric data selection for PVM finetuning. VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence to meet the performance goal. This process is assisted by the inherent semantic richness of the text embedding space which we use to augment image features. Furthermore, the flexibility of text-domain augmentation allows VeCAF to handle out-of-distribution scenarios without external data. Extensive experiments show the leading performance and high computational efficiency of VeCAF that is superior to baselines in both in-distribution and out-of-distribution image classification tasks. On ImageNet, VeCAF uses up to 3.3x less training batches to reach the target performance compared to full finetuning, and achieves an accuracy improvement of 2.7% over the state-of-the-art active finetuning method with the same number of batches.

VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness

TL;DR

VeCAF tackles the inefficiency of finetuning pretrained vision models by jointly selecting informative data with objective-awareness and enriching image representations through language-guided cues. It formalizes an Objective-aware Data Selection (ODS) mechanism that minimises using an ensemble of centroid-based selectors, and couples this with Cross-attentive Embedding Augmentation (CEA) that fuses caption-derived text embeddings into image features via . The approach yields substantial efficiency gains (e.g., up to fewer training batches on ImageNet) and improves accuracy against strong baselines, with robust performance on out-of-distribution data through target-domain caption augmentation. The results demonstrate VeCAF’s versatility across diverse PVMs and language encoders, confirming the practical value of integrating training-objective awareness and language-grounded signals into finetuning workflows.

Abstract

Finetuning a pretrained vision model (PVM) is a common technique for learning downstream vision tasks. However, the conventional finetuning process with randomly sampled data points results in diminished training efficiency. To address this drawback, we propose a novel approach, Vision-language Collaborative Active Finetuning (VeCAF). With the emerging availability of labels and natural language annotations of images through web-scale crawling or controlled generation, VeCAF makes use of these information to perform parametric data selection for PVM finetuning. VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence to meet the performance goal. This process is assisted by the inherent semantic richness of the text embedding space which we use to augment image features. Furthermore, the flexibility of text-domain augmentation allows VeCAF to handle out-of-distribution scenarios without external data. Extensive experiments show the leading performance and high computational efficiency of VeCAF that is superior to baselines in both in-distribution and out-of-distribution image classification tasks. On ImageNet, VeCAF uses up to 3.3x less training batches to reach the target performance compared to full finetuning, and achieves an accuracy improvement of 2.7% over the state-of-the-art active finetuning method with the same number of batches.
Paper Structure (29 sections, 9 equations, 7 figures, 9 tables, 1 algorithm)

This paper contains 29 sections, 9 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: (a) Motivation of VeCAF. We select the optimal subset from a large labeled training set for efficient finetuning (FT) towards a user-specified objective. (b) Training curve comparison on ImageNet-1K validation set. All baselines select 1% of data in each FT loop with the exception of a conventional setup with full-data FT. VeCAF achieves the target accuracy faster with significantly fewer training batches and achieves higher accuracy with the same training cost.
  • Figure 2: The overall framework of VeCAF. In each data selection loop, VeCAF performs an Objective-aware Data Selection (ODS) to select more informative images for finetuning. Cross-attentive Embedding Augmentation (CEA) is performed on the selected images to further enrich the semantic information captured by the image embeddings by incorporating language knowledge of the caption.
  • Figure 3: The selected samples of ActiveFT xie2023active and VeCAF. With the caption augmented: "It is a $\{snowy/rainy\}$ day!", VeCAF can select images that correspond to the target domain.
  • Figure 4: Comparison of training efficiency. VeCAF requires significantly fewer training batches to reach the target accuracy (B2A) compared with other baselines and full-data finetuning. Note that the y-axis has an exponential scale.
  • Figure 5: Training loss curve of VeCAF and other baselines including ActiveFT, ALFA-Mix, and Full data FT on Caltech-101 (left) and ImageNet-1K (right) with 5% and 1% data, respectively.
  • ...and 2 more figures