Table of Contents
Fetching ...

Debiasing Vison-Language Models with Text-Only Training

Yunfan Yang, Chaoquan Jiang, Zhiyu Lin, Jinlin Xiao, Jiaming Zhang, Jitao Sang

TL;DR

A Text-Only Debiasing framework called TOD is proposed, leveraging a text-as-image training paradigm to mitigate visual biases and significantly improves group robustness, achieving state-of-the-art results among image-free methods and even competitive performance compared to image-supervised methods.

Abstract

Pre-trained vision-language models (VLMs), such as CLIP, have exhibited remarkable performance across various downstream tasks by aligning text and images in a unified embedding space. However, due to the imbalanced distribution of pre-trained datasets, CLIP suffers from the bias problem in real-world applications. Existing debiasing methods struggle to obtain sufficient image samples for minority groups and incur high costs for group labeling. To address the limitations, we propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases. Specifically, this approach repurposes the text encoder to function as an image encoder, thereby eliminating the need for image data. Simultaneously, it utilizes a large language model (LLM) to generate a balanced text dataset, which is then used for prompt tuning. However, we observed that the model overfits to the text modality because label names, serving as supervision signals, appear explicitly in the texts. To address this issue, we further introduce a Multi-Target Prediction (MTP) task that motivates the model to focus on complex contexts and distinguish between target and biased information. Extensive experiments on the Waterbirds and CelebA datasets show that our method significantly improves group robustness, achieving state-of-the-art results among image-free methods and even competitive performance compared to image-supervised methods. Furthermore, the proposed method can be adapted to challenging scenarios with multiple or unknown bias attributes, demonstrating its strong generalization and robustness.

Debiasing Vison-Language Models with Text-Only Training

TL;DR

A Text-Only Debiasing framework called TOD is proposed, leveraging a text-as-image training paradigm to mitigate visual biases and significantly improves group robustness, achieving state-of-the-art results among image-free methods and even competitive performance compared to image-supervised methods.

Abstract

Pre-trained vision-language models (VLMs), such as CLIP, have exhibited remarkable performance across various downstream tasks by aligning text and images in a unified embedding space. However, due to the imbalanced distribution of pre-trained datasets, CLIP suffers from the bias problem in real-world applications. Existing debiasing methods struggle to obtain sufficient image samples for minority groups and incur high costs for group labeling. To address the limitations, we propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases. Specifically, this approach repurposes the text encoder to function as an image encoder, thereby eliminating the need for image data. Simultaneously, it utilizes a large language model (LLM) to generate a balanced text dataset, which is then used for prompt tuning. However, we observed that the model overfits to the text modality because label names, serving as supervision signals, appear explicitly in the texts. To address this issue, we further introduce a Multi-Target Prediction (MTP) task that motivates the model to focus on complex contexts and distinguish between target and biased information. Extensive experiments on the Waterbirds and CelebA datasets show that our method significantly improves group robustness, achieving state-of-the-art results among image-free methods and even competitive performance compared to image-supervised methods. Furthermore, the proposed method can be adapted to challenging scenarios with multiple or unknown bias attributes, demonstrating its strong generalization and robustness.

Paper Structure

This paper contains 38 sections, 5 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Overview of Text-Only Debiasing (TOD) framework. (a) Construction of a balanced text dataset using GPT-4o. First, we generate a set of text descriptions for the target attributes and false attributes separately. Then, we randomly sample from these sets and concatenate them to create text descriptions with group labels. Both training and inference process is based on multi-target prediction, which simultaneously predicts target attributes and bias attributes. (b) During training, we use using two identical, frozen text encoders from pre-trained CLIP that separately encode the text descriptions and class prompts. The model is optimized through prompt tuning. (c) During inferencing, we replace the input from text descriptions to images, and take the target attribute from the group with highest logits as the final prediction.
  • Figure 2: The loss curves on Waterbirds and CelebA dataset. The orange and blue lines represent single-target and multi-target training, respectively. Triangle marks denote training loss, and circle marks denote testing loss. We present the normalized loss curves to eliminate the dimensional impact of losses under different prediction targets.
  • Figure 3: Image data sensitivity analyse.
  • Figure 4: Grad-CAM selvaraju2017grad to visualize the effect of zero-shot (ZS) CLIP and TOD. We get the embedding feature of label or bias attribute names in the text, and the highlighted areas indicate the attention of the token embedding to the image.