Table of Contents
Fetching ...

Diffusion Feedback Helps CLIP See Better

Wenxuan Wang, Quan Sun, Fan Zhang, Yepeng Tang, Jing Liu, Xinlong Wang

TL;DR

This paper tackles CLIP's weakness in fine-grained visual perception by introducing DIVA, a post-training method that uses a pre-trained text-to-image diffusion model as a Visual Assistant to refine CLIP representations using only images. By conditioning the diffusion model on CLIP's dense visual features and applying a reconstruction loss, DIVA updates CLIP encodings to recover subtle visual details while keeping the diffusion model fixed. Across 29 benchmarks, DIVA improves MMVP-VLM scores and enhances multimodal large language model backbones and segmentation tasks, all while preserving CLIP's strong zero-shot generalization. The approach is lightweight, self-supervised, and leverages a visual dense recap to balance conditioning richness and reconstruction difficulty, suggesting broad potential for diffusion-guided enhancements in vision-language systems.

Abstract

Contrastive Language-Image Pre-training (CLIP), which excels at abstracting open-world representations across domains and modalities, has become a foundation for a variety of vision and multimodal tasks. However, recent studies reveal that CLIP has severe visual shortcomings, such as which can hardly distinguish orientation, quantity, color, structure, etc. These visual shortcomings also limit the perception capabilities of multimodal large language models (MLLMs) built on CLIP. The main reason could be that the image-text pairs used to train CLIP are inherently biased, due to the lack of the distinctiveness of the text and the diversity of images. In this work, we present a simple post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process. We introduce DIVA, which uses the DIffusion model as a Visual Assistant for CLIP. Specifically, DIVA leverages generative feedback from text-to-image diffusion models to optimize CLIP representations, with only images (without corresponding text). We demonstrate that DIVA improves CLIP's performance on the challenging MMVP-VLM benchmark which assesses fine-grained visual abilities to a large extent (e.g., 3-7%), and enhances the performance of MLLMs and vision models on multimodal understanding and segmentation tasks. Extensive evaluation on 29 image classification and retrieval benchmarks confirms that our framework preserves CLIP's strong zero-shot capabilities. The code is available at https://github.com/baaivision/DIVA.

Diffusion Feedback Helps CLIP See Better

TL;DR

This paper tackles CLIP's weakness in fine-grained visual perception by introducing DIVA, a post-training method that uses a pre-trained text-to-image diffusion model as a Visual Assistant to refine CLIP representations using only images. By conditioning the diffusion model on CLIP's dense visual features and applying a reconstruction loss, DIVA updates CLIP encodings to recover subtle visual details while keeping the diffusion model fixed. Across 29 benchmarks, DIVA improves MMVP-VLM scores and enhances multimodal large language model backbones and segmentation tasks, all while preserving CLIP's strong zero-shot generalization. The approach is lightweight, self-supervised, and leverages a visual dense recap to balance conditioning richness and reconstruction difficulty, suggesting broad potential for diffusion-guided enhancements in vision-language systems.

Abstract

Contrastive Language-Image Pre-training (CLIP), which excels at abstracting open-world representations across domains and modalities, has become a foundation for a variety of vision and multimodal tasks. However, recent studies reveal that CLIP has severe visual shortcomings, such as which can hardly distinguish orientation, quantity, color, structure, etc. These visual shortcomings also limit the perception capabilities of multimodal large language models (MLLMs) built on CLIP. The main reason could be that the image-text pairs used to train CLIP are inherently biased, due to the lack of the distinctiveness of the text and the diversity of images. In this work, we present a simple post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process. We introduce DIVA, which uses the DIffusion model as a Visual Assistant for CLIP. Specifically, DIVA leverages generative feedback from text-to-image diffusion models to optimize CLIP representations, with only images (without corresponding text). We demonstrate that DIVA improves CLIP's performance on the challenging MMVP-VLM benchmark which assesses fine-grained visual abilities to a large extent (e.g., 3-7%), and enhances the performance of MLLMs and vision models on multimodal understanding and segmentation tasks. Extensive evaluation on 29 image classification and retrieval benchmarks confirms that our framework preserves CLIP's strong zero-shot capabilities. The code is available at https://github.com/baaivision/DIVA.
Paper Structure (17 sections, 5 equations, 3 figures, 9 tables, 1 algorithm)

This paper contains 17 sections, 5 equations, 3 figures, 9 tables, 1 algorithm.

Figures (3)

  • Figure 1: Left: The existing CLIP models mostly suffer from the inability to distinguish visual details. After enhancing the visual capabilities with our DIVA, the sensitivity of CLIP to visual details has greatly improved. Right: Our proposed DIVA consistently boosts the performance of various CLIP models radford2021learningfang2023dataxu2023demystifyingzhai2023sigmoid on MMVP-VLM benchmark that evaluates the visual capabilities of vision-language models.
  • Figure 2: Overall architecture of our DIVA. Given an image $x_0$, the CLIP model $\theta$ encodes the visual features as main part of condition $\textbf{c}$, then the generative diffusion model $\phi$ predicts the added noise $\epsilon$ taking the noisy image $x_t$ and condition $\textbf{c}$ as input. We optimize the CLIP's representation by maximizing the image likelihood with the diffusion loss via generative feedback.
  • Figure 3: Qualitative analysis on MMVP-VLM and MMVP benchmark.Left: The prediction results from the OpenAI ViT-L-14 CLIP before & after incorporating DIVA. Right: The prediction results from LLaVA-1.5-7B before & after using our DIVA. The results on both benchmarks show that our framework can greatly enhance CLIP models' fine-grained visual perception capability and effectively alleviate the hallucination problem.