Table of Contents
Fetching ...

Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification

Wenshuo Peng, Kaipeng Zhang, Yue Yang, Hao Zhang, Yu Qiao

TL;DR

This work tackles the problem that weakly correlated image-text pairs in large pre-training datasets prevent vision-language foundation models from fully exploiting available knowledge for downstream image classification. It introduces Data Adaptive Traceback (DAT), a three-module adaptation framework comprising a zero-shot data sampling step to curate a downstream-related pre-training bank, a semi-supervised step with pseudo-labeling to reuse pre-training data, and a semi-unified vision-language contrastive module to mitigate confirmation bias. Across eight benchmarks, DAT consistently improves over standard fine-tuning and semi-supervised baselines, with larger gains observed when using bigger pre-training datasets and models. DAT demonstrates that reusing pre-training data during adaptation can unlock previously neglected downstream knowledge and generalizes to various external image-text datasets and adaptation setups.

Abstract

Vision-language foundation models have been incredibly successful in a wide range of downstream computer vision tasks using adaptation methods. However, due to the high cost of obtaining pre-training datasets, pairs with weak image-text correlation in the data exist in large numbers. We call them weak-paired samples. Due to the limitations of these weak-paired samples, the pre-training model are unable to mine all the knowledge from pre-training data. The existing adaptation methods do not consider the missing knowledge, which may lead to crucial task-related knowledge for the downstream tasks being ignored. To address this issue, we propose a new adaptation framework called Data Adaptive Traceback (DAT). Specifically, we utilize a zero-shot-based method to extract the most downstream task-related subset of the pre-training data to enable the downstream tasks. Furthermore, we adopt a pseudo-label-based semi-supervised technique to reuse the pre-training images and a vision-language contrastive learning method to address the confirmation bias issue in semi-supervised learning. We conduct extensive experiments that show our proposed DAT approach meaningfully improves various benchmark datasets performance over traditional adaptation methods by simply.

Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification

TL;DR

This work tackles the problem that weakly correlated image-text pairs in large pre-training datasets prevent vision-language foundation models from fully exploiting available knowledge for downstream image classification. It introduces Data Adaptive Traceback (DAT), a three-module adaptation framework comprising a zero-shot data sampling step to curate a downstream-related pre-training bank, a semi-supervised step with pseudo-labeling to reuse pre-training data, and a semi-unified vision-language contrastive module to mitigate confirmation bias. Across eight benchmarks, DAT consistently improves over standard fine-tuning and semi-supervised baselines, with larger gains observed when using bigger pre-training datasets and models. DAT demonstrates that reusing pre-training data during adaptation can unlock previously neglected downstream knowledge and generalizes to various external image-text datasets and adaptation setups.

Abstract

Vision-language foundation models have been incredibly successful in a wide range of downstream computer vision tasks using adaptation methods. However, due to the high cost of obtaining pre-training datasets, pairs with weak image-text correlation in the data exist in large numbers. We call them weak-paired samples. Due to the limitations of these weak-paired samples, the pre-training model are unable to mine all the knowledge from pre-training data. The existing adaptation methods do not consider the missing knowledge, which may lead to crucial task-related knowledge for the downstream tasks being ignored. To address this issue, we propose a new adaptation framework called Data Adaptive Traceback (DAT). Specifically, we utilize a zero-shot-based method to extract the most downstream task-related subset of the pre-training data to enable the downstream tasks. Furthermore, we adopt a pseudo-label-based semi-supervised technique to reuse the pre-training images and a vision-language contrastive learning method to address the confirmation bias issue in semi-supervised learning. We conduct extensive experiments that show our proposed DAT approach meaningfully improves various benchmark datasets performance over traditional adaptation methods by simply.
Paper Structure (23 sections, 7 equations, 4 figures, 7 tables)

This paper contains 23 sections, 7 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: (a) An illustration of our motivation. The weak pre-training image descriptions lead the model to ignore some knowledge (e.g., dog and cat in the figure), which may be related to a downstream task. (b) We propose Data Adaptive Traceback (DAT) to retrieve downstream-most-related pre-training data efficiently through the data sampling module and learn them effectively through the semi-supervised module and the semi-unified contrastive module. Please see Figure \ref{['fig3']}, \ref{['fig4']}, \ref{['fig111']} and corresponding sections for more details.
  • Figure 2: The pipeline of the data sampling module. By two consecutive samplings, we sample a subset from large-scale pre-training data, which resembles the distribution and class of downstream images.
  • Figure 3: The pipeline of the semi-supervised module. With two data augmentation approaches, we can provide pseudo-labels to the pre-training bank images so that the pre-training bank images can be trained in a conventional supervised learning manner.
  • Figure 4: Explanation of semi-unified contrastive module. We combine semi-supervised module to map downstream images and pre-training bank datasets into a common image-text-label data space by means of prompt engineering. We cluster the pre-training bank images and down- stream images by adding image-text contrast loss in the adaptation process to solve the confirmation bias problem.