Table of Contents
Fetching ...

ViT-AdaLA: Adapting Vision Transformers with Linear Attention

Yifan Li, Seunghyun Yoon, Viet Dac Lai, Franck Dernoncourt, Jason Kuen, Yu Kong, Trung Bui

Abstract

Vision Transformers (ViTs) based vision foundation models (VFMs) have achieved remarkable performance across diverse vision tasks, but suffer from quadratic complexity that limits scalability to long sequences. Existing linear attention approaches for ViTs are typically trained from scratch, requiring substantial computational resources, while linearization-based methods developed for large language model decoders do not transfer well to ViTs. To address these challenges, we propose ViT-AdaLA, a novel framework for effectively adapting and transferring prior knowledge from VFMs to linear attention ViTs. ViT-AdaLA consists of three stages: attention alignment, feature alignment, and supervised fine-tuning. In the attention alignment stage, we align vanilla linear attention with the original softmax-based attention in each block to approximate the behavior of softmax attention. However, residual approximation errors inevitably accumulate across layers. We mitigate this by fine-tuning the linearized ViT to align its final-layer features with a frozen softmax VFM teacher. Finally, the adapted prior knowledge is transferred to downstream tasks through supervised fine-tuning. Extensive experiments on classification and segmentation tasks demonstrate the effectiveness and generality of ViT-AdaLA over various state-of-the-art linear attention counterpart.

ViT-AdaLA: Adapting Vision Transformers with Linear Attention

Abstract

Vision Transformers (ViTs) based vision foundation models (VFMs) have achieved remarkable performance across diverse vision tasks, but suffer from quadratic complexity that limits scalability to long sequences. Existing linear attention approaches for ViTs are typically trained from scratch, requiring substantial computational resources, while linearization-based methods developed for large language model decoders do not transfer well to ViTs. To address these challenges, we propose ViT-AdaLA, a novel framework for effectively adapting and transferring prior knowledge from VFMs to linear attention ViTs. ViT-AdaLA consists of three stages: attention alignment, feature alignment, and supervised fine-tuning. In the attention alignment stage, we align vanilla linear attention with the original softmax-based attention in each block to approximate the behavior of softmax attention. However, residual approximation errors inevitably accumulate across layers. We mitigate this by fine-tuning the linearized ViT to align its final-layer features with a frozen softmax VFM teacher. Finally, the adapted prior knowledge is transferred to downstream tasks through supervised fine-tuning. Extensive experiments on classification and segmentation tasks demonstrate the effectiveness and generality of ViT-AdaLA over various state-of-the-art linear attention counterpart.
Paper Structure (26 sections, 8 equations, 14 figures, 12 tables)

This paper contains 26 sections, 8 equations, 14 figures, 12 tables.

Figures (14)

  • Figure 1: Comparison between training-from-scratch and linearization paradigms for ViTs with linear attention. Training-from-scratch linear attention paradigms focus on designing accurate attention approximation methods, which typically require large-scale pretraining to acquire strong prior knowledge. In contrast, ViT linearization leverages an off-the-shelf pretrained ViT, substantially reducing the need for extensive pretraining.
  • Figure 2: Comparison of decoder and encoder–decoder architectures. In decoder-based LLMs, the LLM serves as both the feature extractor and the target generator. In contrast, in vision models, ViTs function solely as feature extractors, while a separate task-specific head is responsible for target generation.
  • Figure 3: Efficiency comparison of different attentions, including peak memory and GFLOPS varying with sequence length. Only attention module is benchmarked in these experiments. "Vanilla" indicates the vanilla linear attention katharopoulos2020transformers.
  • Figure 4: ViT-AdaLA consists of three stages: attention alignment, feature alignment, and supervised fine-tuning. First, softmax attention is approximated by tuning only the linear attention modules. Second, to mitigate residual approximation errors that accumulate across layers, the feature alignment stage finetunes the entire linearized model by aligning its final-layer representations with those of the original softmax-based teacher. Finally, supervised fine-tuning is performed to transfer the adapted prior knowledge to downstream tasks.
  • Figure 5: Linear attention architecture and Stage 1 training loss comparison with LoLCATS ($SM$: softmax; $\oplus$: concatenation). LoLCATS approximates the attention output based on Hedgehog zhanghedgehog by tuning only two additional mapping modules applied to the queries and keys individually. In contrast, we tune all query, key, and value weights to approximate the attention output, which is both more efficient and more effective than the original LoLCATS approach.
  • ...and 9 more figures