Table of Contents
Fetching ...

DiffCLIP: Differential Attention Meets CLIP

Hasan Abed Al Kader Hammoud, Bernard Ghanem

TL;DR

DiffCLIP extends the differential attention mechanism to the CLIP vision-language framework to suppress attention noise and sharpen cross-modal alignment with minimal overhead. By learning two complementary attention maps and subtracting one from the other in both vision and text streams, it yields more focused representations and stronger performance on linear probing, few-shot, retrieval, and zero-shot tasks, including improved robustness to out-of-domain shifts. The approach achieves these gains with roughly 0.003% extra parameters and demonstrates that differential attention can be effectively ported to multimodal settings, with additional benefits when applied to just the vision encoder. Ablation studies reveal a flexible design, including dynamic versus static lambda initialization and vision-only configurations, while early scaling ideas suggest further gains from larger models and datasets. Overall, DiffCLIP offers a lightweight, robust enhancement to vision-language pretraining with practical implications for multimodal understanding and deployment.

Abstract

We propose DiffCLIP, a novel vision-language model that extends the differential attention mechanism to CLIP architectures. Differential attention was originally developed for large language models to amplify relevant context while canceling out noisy information. In this work, we integrate this mechanism into CLIP's dual encoder (image and text) framework. With minimal additional parameters, DiffCLIP achieves superior performance on image-text understanding tasks. Across zero-shot classification, retrieval, and robustness benchmarks, DiffCLIP consistently outperforms baseline CLIP models. Notably, these gains come with negligible computational overhead, demonstrating that differential attention can significantly enhance multi-modal representations without sacrificing efficiency. Code can be found at https://github.com/hammoudhasan/DiffCLIP.

DiffCLIP: Differential Attention Meets CLIP

TL;DR

DiffCLIP extends the differential attention mechanism to the CLIP vision-language framework to suppress attention noise and sharpen cross-modal alignment with minimal overhead. By learning two complementary attention maps and subtracting one from the other in both vision and text streams, it yields more focused representations and stronger performance on linear probing, few-shot, retrieval, and zero-shot tasks, including improved robustness to out-of-domain shifts. The approach achieves these gains with roughly 0.003% extra parameters and demonstrates that differential attention can be effectively ported to multimodal settings, with additional benefits when applied to just the vision encoder. Ablation studies reveal a flexible design, including dynamic versus static lambda initialization and vision-only configurations, while early scaling ideas suggest further gains from larger models and datasets. Overall, DiffCLIP offers a lightweight, robust enhancement to vision-language pretraining with practical implications for multimodal understanding and deployment.

Abstract

We propose DiffCLIP, a novel vision-language model that extends the differential attention mechanism to CLIP architectures. Differential attention was originally developed for large language models to amplify relevant context while canceling out noisy information. In this work, we integrate this mechanism into CLIP's dual encoder (image and text) framework. With minimal additional parameters, DiffCLIP achieves superior performance on image-text understanding tasks. Across zero-shot classification, retrieval, and robustness benchmarks, DiffCLIP consistently outperforms baseline CLIP models. Notably, these gains come with negligible computational overhead, demonstrating that differential attention can significantly enhance multi-modal representations without sacrificing efficiency. Code can be found at https://github.com/hammoudhasan/DiffCLIP.

Paper Structure

This paper contains 28 sections, 18 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: CC3M Pretraining: CLIP vs. DiffCLIP Across Six Tasks. We compare standard CLIP (blue) and our DiffCLIP variant (pink) on linear probing, few-shot classification, image/text retrieval, zero-shot ImageNet, and zero-shot OOD. In each case, DiffCLIP consistently outperforms CLIP, highlighting the effectiveness of differential attention with only 0.003% extra parameters.
  • Figure 2: Comparing CLIP vs. DiffCLIP Attention Maps. For two images (rows), we visualize where CLIP and DiffCLIP attend when matching each image against two different textual queries. While CLIP allocates attention to irrelevant background regions, DiffCLIP more effectively centers on query-relevant objects, highlighting how differential attention can reduce noise and improve focus. Queries: First Row: 'Mug", Lamp"; Second Row: Flower", Dog".
  • Figure 3: OOD Zero-Shot ImageNet Performance. Comparison of zero-shot accuracy (%) on ImageNet, ImageNet-V2, ImageNet-A, ImageNet-R, and ImageNet-Sketch, plus the average. Bars show performance of CLIP (blue) versus DiffCLIP (pink), trained on CC3M (left) or CC12M (right). Numerical deltas above the bars indicate the absolute improvement or drop for DiffCLIP relative to CLIP. DiffCLIP improves on average the zero-shot performance on OOD ImageNet datasets as compared to CLIP.
  • Figure 4: MMVP-VLM Benchmarking. Radar plot illustrating performance on different fine-grained visual categories. Both models (CLIP in blue, DiffCLIP in pink) are evaluated on properties like orientation, positional context, and color appearance. DiffCLIP (average 27.6%) consistently outperforms CLIP (average 21.9%), demonstrating more focused attention on subtle visual details.
  • Figure 5: Comparing Different DiffCLIP Variants. We evaluate four models on six tasks (linear probing, few-shot, image retrieval, text retrieval, ImageNet zero-shot, and zero-shot OOD), all pretrained on CC12M. CLIP (blue) is the baseline, DiffCLIP (pink) uses a fixed differential attention parameter, DiffCLIP$^*$ (purple) employs a dynamic schedule for differential attention, and DiffCLIP$^\dagger$ (yellow) applies differential attention only to the vision encoder.