Table of Contents
Fetching ...

Enhance Vision-Language Alignment with Noise

Sida Huang, Hongyuan Zhang, Xuelong Li

TL;DR

PiNI introduces a novel noise-based fine-tuning framework for CLIP that learns a beneficial noise distribution (pi-noise) and injects it into both visual and textual encoders to improve vision–language alignment under few-shot constraints. By reformulating CLIP inference to treat prompts as a stochastic variable and applying variational inference, PiNI derives a tractable objective that guides noise generation conditioned on prompts. Empirical results across 11 datasets show PiNI outperforms zero-shot CLIP and several PEFT baselines, with pronounced gains in very low-shot regimes and robust domain generalization. The work highlights a new direction in VL fine-tuning that leverages learned noise to diversify embeddings and reduce dataset bias, with potential extensions to VQA, detection, and generation tasks.

Abstract

With the advancement of pre-trained vision-language (VL) models, enhancing the alignment between visual and linguistic modalities in downstream tasks has emerged as a critical challenge. Different from existing fine-tuning methods that add extra modules to these two modalities, we investigate whether the frozen model can be fine-tuned by customized noise. Our approach is motivated by the scientific study of beneficial noise, namely Positive-incentive Noise (Pi-noise or $π$-noise) , which quantitatively analyzes the impact of noise. It therefore implies a new scheme to learn beneficial noise distribution that can be employed to fine-tune VL models. Focusing on few-shot classification tasks based on CLIP, we reformulate the inference process of CLIP and apply variational inference, demonstrating how to generate $π$-noise towards visual and linguistic modalities. Then, we propose Positive-incentive Noise Injector (PiNI), which can fine-tune CLIP via injecting noise into both visual and text encoders. Since the proposed method can learn the distribution of beneficial noise, we can obtain more diverse embeddings of vision and language to better align these two modalities for specific downstream tasks within limited computational resources. We evaluate different noise incorporation approaches and network architectures of PiNI. The evaluation across 11 datasets demonstrates its effectiveness.

Enhance Vision-Language Alignment with Noise

TL;DR

PiNI introduces a novel noise-based fine-tuning framework for CLIP that learns a beneficial noise distribution (pi-noise) and injects it into both visual and textual encoders to improve vision–language alignment under few-shot constraints. By reformulating CLIP inference to treat prompts as a stochastic variable and applying variational inference, PiNI derives a tractable objective that guides noise generation conditioned on prompts. Empirical results across 11 datasets show PiNI outperforms zero-shot CLIP and several PEFT baselines, with pronounced gains in very low-shot regimes and robust domain generalization. The work highlights a new direction in VL fine-tuning that leverages learned noise to diversify embeddings and reduce dataset bias, with potential extensions to VQA, detection, and generation tasks.

Abstract

With the advancement of pre-trained vision-language (VL) models, enhancing the alignment between visual and linguistic modalities in downstream tasks has emerged as a critical challenge. Different from existing fine-tuning methods that add extra modules to these two modalities, we investigate whether the frozen model can be fine-tuned by customized noise. Our approach is motivated by the scientific study of beneficial noise, namely Positive-incentive Noise (Pi-noise or -noise) , which quantitatively analyzes the impact of noise. It therefore implies a new scheme to learn beneficial noise distribution that can be employed to fine-tune VL models. Focusing on few-shot classification tasks based on CLIP, we reformulate the inference process of CLIP and apply variational inference, demonstrating how to generate -noise towards visual and linguistic modalities. Then, we propose Positive-incentive Noise Injector (PiNI), which can fine-tune CLIP via injecting noise into both visual and text encoders. Since the proposed method can learn the distribution of beneficial noise, we can obtain more diverse embeddings of vision and language to better align these two modalities for specific downstream tasks within limited computational resources. We evaluate different noise incorporation approaches and network architectures of PiNI. The evaluation across 11 datasets demonstrates its effectiveness.

Paper Structure

This paper contains 32 sections, 14 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Different strategies for constructing prompts. A single image can be described by multiple prompts. The first and second strategies each correspond to one prompt per image. After fine-tuning, the learnable prompt becomes closer to the image in the embedding space. After learning the noise distribution in prompts, we can easily sample prompts with richer and more precise semantics as needed.
  • Figure 2: The probabilistic graphical model of PiNI.
  • Figure 3: (a) The inference process of CLIP, whose output features are utilized in PINI. (b) The framework of PiNI, which includes two procedures for adding noise. There are multiple options for the input of noise generators and locations of noise injection, whose potential transmission paths are represented with dashed lines. In the figure, $\odot$ denotes the Hadamard product, $\oplus$ denotes matrix or vector addition, and $\otimes$ denotes matrix multiplication. (c)(d)(e) are three architectures of noise generators for learning distribution parameters.
  • Figure 4: Visualization of generated noise injected into raw images. The first row shows the raw images. The second row displays the noise-injected images. The third and fourth rows present the heatmaps of the mean $\mu$ and variance $\sigma$ for each pixel, respectively. In the first column, the noise deepens the color of an old barrel, making it look new again and thereby reducing the bias between datasets. The vegetation and ball in the last column are disturbed by noise, simplifying the task of recognizing.
  • Figure 5: Performance of few-shot learning across 11 datasets. In the top-left subplot, the results are averaged over 11 datasets. PiNI shows better performance compared to baselines, especially under conditions with fewer shots.
  • ...and 4 more figures