Table of Contents
Fetching ...

Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials

Yifan Pu, Jixuan Ying, Qixiu Li, Tianzhu Ye, Dongchen Han, Xiaochen Wang, Ziyi Wang, Xinyu Shao, Gao Huang, Xiu Li

TL;DR

This paper tackles the quadratic bottleneck of self-attention in Vision Transformers by introducing Visual-Contrast Attention (VCA), a two-stage, linear-time mechanism that injects explicit discrimination into attention through visual-contrast tokens. Stage I globally summarizes the scene into a small set of contrast tokens with separate positive and negative streams, producing a contour of differences; Stage II performs patch-wise differential attention using this contrast map, preserving the global receptive field while achieving $\mathcal{O}(N n d)$ complexity. Empirically, VCA delivers consistent accuracy gains on ImageNet-1K across plain and hierarchical ViTs and improves generative quality (FID-50K) for both diffusion and flow-based models, all with negligible parameter and no additional FLOPs. The approach is architecture-agnostic and can replace MHSA in a wide range of ViTs, offering a practical path to faster, sharper vision models that maintain or improve performance.

Abstract

Vision Transformers (ViTs) have become a universal backbone for both image recognition and image generation. Yet their Multi-Head Self-Attention (MHSA) layer still performs a quadratic query-key interaction for every token pair, spending the bulk of computation on visually weak or redundant correlations. We introduce Visual-Contrast Attention (VCA), a drop-in replacement for MHSA that injects an explicit notion of discrimination while reducing the theoretical complexity from O(N N C) to O(N n C) with n << N. VCA first distils each head's dense query field into a handful of spatially pooled visual-contrast tokens, then splits them into a learnable positive and negative stream whose differential interaction highlights what truly separates one region from another. The module adds fewer than 0.3M parameters to a DeiT-Tiny backbone, requires no extra FLOPs, and is wholly architecture-agnostic. Empirically, VCA lifts DeiT-Tiny top-1 accuracy on ImageNet-1K from 72.2% to 75.6% (+3.4) and improves three strong hierarchical ViTs by up to 3.1%, while in class-conditional ImageNet generation it lowers FID-50K by 2.1 to 5.2 points across both diffusion (DiT) and flow (SiT) models. Extensive ablations confirm that (i) spatial pooling supplies low-variance global cues, (ii) dual positional embeddings are indispensable for contrastive reasoning, and (iii) combining the two in both stages yields the strongest synergy. VCA therefore offers a simple path towards faster and sharper Vision Transformers. The source code is available at https://github.com/LeapLabTHU/LinearDiff.

Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials

TL;DR

This paper tackles the quadratic bottleneck of self-attention in Vision Transformers by introducing Visual-Contrast Attention (VCA), a two-stage, linear-time mechanism that injects explicit discrimination into attention through visual-contrast tokens. Stage I globally summarizes the scene into a small set of contrast tokens with separate positive and negative streams, producing a contour of differences; Stage II performs patch-wise differential attention using this contrast map, preserving the global receptive field while achieving complexity. Empirically, VCA delivers consistent accuracy gains on ImageNet-1K across plain and hierarchical ViTs and improves generative quality (FID-50K) for both diffusion and flow-based models, all with negligible parameter and no additional FLOPs. The approach is architecture-agnostic and can replace MHSA in a wide range of ViTs, offering a practical path to faster, sharper vision models that maintain or improve performance.

Abstract

Vision Transformers (ViTs) have become a universal backbone for both image recognition and image generation. Yet their Multi-Head Self-Attention (MHSA) layer still performs a quadratic query-key interaction for every token pair, spending the bulk of computation on visually weak or redundant correlations. We introduce Visual-Contrast Attention (VCA), a drop-in replacement for MHSA that injects an explicit notion of discrimination while reducing the theoretical complexity from O(N N C) to O(N n C) with n << N. VCA first distils each head's dense query field into a handful of spatially pooled visual-contrast tokens, then splits them into a learnable positive and negative stream whose differential interaction highlights what truly separates one region from another. The module adds fewer than 0.3M parameters to a DeiT-Tiny backbone, requires no extra FLOPs, and is wholly architecture-agnostic. Empirically, VCA lifts DeiT-Tiny top-1 accuracy on ImageNet-1K from 72.2% to 75.6% (+3.4) and improves three strong hierarchical ViTs by up to 3.1%, while in class-conditional ImageNet generation it lowers FID-50K by 2.1 to 5.2 points across both diffusion (DiT) and flow (SiT) models. Extensive ablations confirm that (i) spatial pooling supplies low-variance global cues, (ii) dual positional embeddings are indispensable for contrastive reasoning, and (iii) combining the two in both stages yields the strongest synergy. VCA therefore offers a simple path towards faster and sharper Vision Transformers. The source code is available at https://github.com/LeapLabTHU/LinearDiff.

Paper Structure

This paper contains 24 sections, 24 equations, 4 tables.