Table of Contents
Fetching ...

CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts

Yichao Cai, Yuhang Liu, Zhen Zhang, Javen Qinfeng Shi

TL;DR

This work addresses the tendency of CLIP-like vision-language models to entangle content and style, harming generalization under distribution shifts and prompt variation. It introduces a causal perspective and two augmentation-based strategies, Im.Aug (image augmentation) and CLAP (augmented prompts), to disentangle content from style and improve representations, with CLAP leveraging the pre-trained CLIP text encoder and transferring gains to the image pathway. Across four multi-domain datasets, CLAP delivers consistent zero-shot and few-shot improvements and enhanced robustness to adversarial perturbations, outperforming both CLIP and Im.Aug and providing rich ablation and visualization analyses. The findings demonstrate that content-focused representations can be more effectively learned via text-based augmentation, suggesting promising directions for cross-modal augmentation and more robust multimodal learning systems.

Abstract

Contrastive vision-language models, such as CLIP, have garnered considerable attention for various downstream tasks, mainly due to the remarkable ability of the learned features for generalization. However, the features they learned often blend content and style information, which somewhat limits their generalization capabilities under distribution shifts. To address this limitation, we adopt a causal generative perspective for multimodal data and propose contrastive learning with data augmentation to disentangle content features from the original representations. To achieve this, we begin with exploring image augmentation techniques and develop a method to seamlessly integrate them into pre-trained CLIP-like models to extract pure content features. Taking a step further, recognizing the inherent semantic richness and logical structure of text data, we explore the use of text augmentation to isolate latent content from style features. This enables CLIP-like model's encoders to concentrate on latent content information, refining the learned representations by pre-trained CLIP-like models. Our extensive experiments across diverse datasets demonstrate significant improvements in zero-shot and few-shot classification tasks, alongside enhanced robustness to various perturbations. These results underscore the effectiveness of our proposed methods in refining vision-language representations and advancing the state-of-the-art in multimodal learning.

CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts

TL;DR

This work addresses the tendency of CLIP-like vision-language models to entangle content and style, harming generalization under distribution shifts and prompt variation. It introduces a causal perspective and two augmentation-based strategies, Im.Aug (image augmentation) and CLAP (augmented prompts), to disentangle content from style and improve representations, with CLAP leveraging the pre-trained CLIP text encoder and transferring gains to the image pathway. Across four multi-domain datasets, CLAP delivers consistent zero-shot and few-shot improvements and enhanced robustness to adversarial perturbations, outperforming both CLIP and Im.Aug and providing rich ablation and visualization analyses. The findings demonstrate that content-focused representations can be more effectively learned via text-based augmentation, suggesting promising directions for cross-modal augmentation and more robust multimodal learning systems.

Abstract

Contrastive vision-language models, such as CLIP, have garnered considerable attention for various downstream tasks, mainly due to the remarkable ability of the learned features for generalization. However, the features they learned often blend content and style information, which somewhat limits their generalization capabilities under distribution shifts. To address this limitation, we adopt a causal generative perspective for multimodal data and propose contrastive learning with data augmentation to disentangle content features from the original representations. To achieve this, we begin with exploring image augmentation techniques and develop a method to seamlessly integrate them into pre-trained CLIP-like models to extract pure content features. Taking a step further, recognizing the inherent semantic richness and logical structure of text data, we explore the use of text augmentation to isolate latent content from style features. This enables CLIP-like model's encoders to concentrate on latent content information, refining the learned representations by pre-trained CLIP-like models. Our extensive experiments across diverse datasets demonstrate significant improvements in zero-shot and few-shot classification tasks, alongside enhanced robustness to various perturbations. These results underscore the effectiveness of our proposed methods in refining vision-language representations and advancing the state-of-the-art in multimodal learning.
Paper Structure (47 sections, 4 equations, 7 figures, 19 tables)

This paper contains 47 sections, 4 equations, 7 figures, 19 tables.

Figures (7)

  • Figure 1: Causal generative models of vision-language data. Image and text data are generated through distinct underlying deterministic processes, $\mathbf{g_x}$ for images and $\mathbf{g_t}$ for texts, derived from a unified latent space with latent content variables $\mathbf{c}$ and latent style variables $\mathbf{s}$. Latent content $\mathbf{c}$ exclusively determines the sample label $\mathbf{y}$. (a) Soft interventions on latent style variables $\mathbf{s}$ result in $\mathbf{\tilde{s}}$, subsequently generating augmented images $\mathbf{\tilde{x}}$. (b) Due to the same latent space, soft interventions on latent style variables $\mathbf{s}$ can also result in $\mathbf{\tilde{s}}$, producing augmented text $\mathbf{\tilde{t}}$. (c) A qualitative comparison of image features for zero-shot classification using "a photo of a [class]" prompts, visualized using class activation map (CAM) sMamoolerCLIPExplain, demonstrates that while image augmentation can enhance CLIP features, the features obtained through text augmentation methods predominantly focus on the content.
  • Figure 1: Examples of synthetic images created with SDv2.1 and associated prompts.
  • Figure 2: Refining CLIP through data augmentation. (a) Training involves a disentangled network $\mathbf{f_c}$, utilizing contrastive loss on original and augmented image pairs $\mathbf{x}$ and $\mathbf{\tilde{x}}$, with CLIP's image encoder $\mathbf{f^*_x}$ holding frozen gradients. (b) More efficient content feature learning is achieved through contrastive learning with augmented text prompts $\mathbf{t}$ and $\mathbf{\tilde{t}}$, using the fixed text encoder $\mathbf{f^*_t}$ of CLIP. (c) Inference stage: The trained disentangled network $\mathbf{f^*_c}$ integrates with CLIP's text and image encoders, $\mathbf{f^*_t}$ and $\mathbf{f^*_x}$, to enable zero-shot inference for an input image $\mathbf{x}$ and class names $\mathbf{t}_1$ to $\mathbf{t}_n$.
  • Figure 3: Structure of the disentangled network. The architecture encompass a residual block featuring a zero-initialized, bias-free linear layer to commence optimization from the input feature space. When the input and output dimension differ, a downsampling operation is utilized to achieve alignment. During inference, a scalar parameter $\alpha$ balance the main branch and input features before combination.
  • Figure 4: Few-shot linear probe comparisons of image-encoder features show that CLAP enhances CLIP's few-shot performance more effectively than Im.Aug. In the accompanying figure, "ZS" indicates the zero-shot performance using a "[class]" prompt.
  • ...and 2 more figures