Enhance Vision-Language Alignment with Noise
Sida Huang, Hongyuan Zhang, Xuelong Li
TL;DR
PiNI introduces a novel noise-based fine-tuning framework for CLIP that learns a beneficial noise distribution (pi-noise) and injects it into both visual and textual encoders to improve vision–language alignment under few-shot constraints. By reformulating CLIP inference to treat prompts as a stochastic variable and applying variational inference, PiNI derives a tractable objective that guides noise generation conditioned on prompts. Empirical results across 11 datasets show PiNI outperforms zero-shot CLIP and several PEFT baselines, with pronounced gains in very low-shot regimes and robust domain generalization. The work highlights a new direction in VL fine-tuning that leverages learned noise to diversify embeddings and reduce dataset bias, with potential extensions to VQA, detection, and generation tasks.
Abstract
With the advancement of pre-trained vision-language (VL) models, enhancing the alignment between visual and linguistic modalities in downstream tasks has emerged as a critical challenge. Different from existing fine-tuning methods that add extra modules to these two modalities, we investigate whether the frozen model can be fine-tuned by customized noise. Our approach is motivated by the scientific study of beneficial noise, namely Positive-incentive Noise (Pi-noise or $π$-noise) , which quantitatively analyzes the impact of noise. It therefore implies a new scheme to learn beneficial noise distribution that can be employed to fine-tune VL models. Focusing on few-shot classification tasks based on CLIP, we reformulate the inference process of CLIP and apply variational inference, demonstrating how to generate $π$-noise towards visual and linguistic modalities. Then, we propose Positive-incentive Noise Injector (PiNI), which can fine-tune CLIP via injecting noise into both visual and text encoders. Since the proposed method can learn the distribution of beneficial noise, we can obtain more diverse embeddings of vision and language to better align these two modalities for specific downstream tasks within limited computational resources. We evaluate different noise incorporation approaches and network architectures of PiNI. The evaluation across 11 datasets demonstrates its effectiveness.
