Table of Contents
Fetching ...

ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models

Guoyizhe Wei, Rama Chellappa

TL;DR

ViT-Linearizer presents a cross-architecture distillation framework that transfers quadratic self-attention knowledge from Vision Transformers to linear-time recurrent vision models. By combining activation matching over intermediate token-activation maps with masked prediction for unseen tokens, the approach enables a linear-time student to inherit rich ViT representations while maintaining efficient inference. Empirically, the method delivers substantial speedups on high-resolution tasks (e.g., Cityscapes and ADE20K) with minimal accuracy loss, and sets new state-of-the-art results for Mamba-based architectures on ImageNet and segmentation benchmarks. This work demonstrates a practical pathway to bridge the gap between the superior representational power of ViTs and the efficiency of linear-time models, with broad applicability across teacher choices and student capacities.

Abstract

Vision Transformers (ViTs) have delivered remarkable progress through global self-attention, yet their quadratic complexity can become prohibitive for high-resolution inputs. In this work, we present ViT-Linearizer, a cross-architecture distillation framework that transfers rich ViT representations into a linear-time, recurrent-style model. Our approach leverages 1) activation matching, an intermediate constraint that encourages student to align its token-wise dependencies with those produced by the teacher, and 2) masked prediction, a contextual reconstruction objective that requires the student to predict the teacher's representations for unseen (masked) tokens, to effectively distill the quadratic self-attention knowledge into the student while maintaining efficient complexity. Empirically, our method provides notable speedups particularly for high-resolution tasks, significantly addressing the hardware challenges in inference. Additionally, it also elevates Mamba-based architectures' performance on standard vision benchmarks, achieving a competitive 84.3% top-1 accuracy on ImageNet with a base-sized model. Our results underscore the good potential of RNN-based solutions for large-scale visual tasks, bridging the gap between theoretical efficiency and real-world practice.

ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models

TL;DR

ViT-Linearizer presents a cross-architecture distillation framework that transfers quadratic self-attention knowledge from Vision Transformers to linear-time recurrent vision models. By combining activation matching over intermediate token-activation maps with masked prediction for unseen tokens, the approach enables a linear-time student to inherit rich ViT representations while maintaining efficient inference. Empirically, the method delivers substantial speedups on high-resolution tasks (e.g., Cityscapes and ADE20K) with minimal accuracy loss, and sets new state-of-the-art results for Mamba-based architectures on ImageNet and segmentation benchmarks. This work demonstrates a practical pathway to bridge the gap between the superior representational power of ViTs and the efficiency of linear-time models, with broad applicability across teacher choices and student capacities.

Abstract

Vision Transformers (ViTs) have delivered remarkable progress through global self-attention, yet their quadratic complexity can become prohibitive for high-resolution inputs. In this work, we present ViT-Linearizer, a cross-architecture distillation framework that transfers rich ViT representations into a linear-time, recurrent-style model. Our approach leverages 1) activation matching, an intermediate constraint that encourages student to align its token-wise dependencies with those produced by the teacher, and 2) masked prediction, a contextual reconstruction objective that requires the student to predict the teacher's representations for unseen (masked) tokens, to effectively distill the quadratic self-attention knowledge into the student while maintaining efficient complexity. Empirically, our method provides notable speedups particularly for high-resolution tasks, significantly addressing the hardware challenges in inference. Additionally, it also elevates Mamba-based architectures' performance on standard vision benchmarks, achieving a competitive 84.3% top-1 accuracy on ImageNet with a base-sized model. Our results underscore the good potential of RNN-based solutions for large-scale visual tasks, bridging the gap between theoretical efficiency and real-world practice.

Paper Structure

This paper contains 26 sections, 6 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Accuracy-efficiency trade-offs. We distill CLIP's ViT-Base model into a linear-time Adventurer-Base adventurer (with Mamba-2 mamba2 token mixers), which exhibits substantially superior accuracy-efficiency trade-offs across various datasets and tasks.
  • Figure 2: Overview of our cross-architecture distillation pipeline. We feed the complete input image to the frozen teacher (ViT) while providing a randomly masked image to the student (a recurrent model such as Adventurer adventurer). At $K$ intermediate stages, we enforce a token-wise matching between the teacher’s and student’s activation maps. In the final layer, the student predicts the teacher’s representations for the unseen (masked) tokens. Only the student network is trained, while the teacher remains frozen throughout.
  • Figure 3: Qualitative comparison of activation maps. The teacher model (CLIP ViT-B/16) consistently produces high-contrast activations with distinctly highlighted salient regions. The supervised Adventurer baseline (denoted "Sup.") exhibits noisy activations. Our distilled Adventurer model shows significant improvements, with feature patterns closely aligning to those of its ViT teacher.