Table of Contents
Fetching ...

big.LITTLE Vision Transformer for Efficient Visual Recognition

He Guo, Yulong Wang, Zixuan Ye, Jifeng Dai, Yuwen Xiong

TL;DR

The paper tackles the slow inference of Vision Transformers by introducing the big.LITTLE Vision Transformer (bLViT), a dual-block architecture that dynamically routes tokens between a high-capacity Performance block and a fast Efficiency block using token importance scores. By processing top tokens with the P-block and updating all tokens with the E-block, it preserves contextual information while dramatically reducing computation, achieving a per-layer cost of $6.5NC^2 + N^2C$ and a speedup over $1.84\times$ in theory. Training employs feature distillation from a vanilla ViT to mitigate pruning-induced performance loss, with a total loss $L_{total} = L_{supervised} + \lambda_{fd} L_{fd}$ and $L_{fd}$ defined via cosine similarity between student and teacher features. Empirically, bLViT delivers strong accuracy on ImageNet-1K and competitive segmentation performance on SAM-related tasks, while reducing GFLOPs by roughly half for the lightweight B+T configuration, and maintaining robust results across larger P/E-block pairings. The work demonstrates a practical and scalable path to efficient, high-performance ViT-based vision systems suitable for real-world deployment.

Abstract

In this paper, we introduce the big.LITTLE Vision Transformer, an innovative architecture aimed at achieving efficient visual recognition. This dual-transformer system is composed of two distinct blocks: the big performance block, characterized by its high capacity and substantial computational demands, and the LITTLE efficiency block, designed for speed with lower capacity. The key innovation of our approach lies in its dynamic inference mechanism. When processing an image, our system determines the importance of each token and allocates them accordingly: essential tokens are processed by the high-performance big model, while less critical tokens are handled by the more efficient little model. This selective processing significantly reduces computational load without sacrificing the overall performance of the model, as it ensures that detailed analysis is reserved for the most important information. To validate the effectiveness of our big.LITTLE Vision Transformer, we conducted comprehensive experiments on image classification and segment anything task. Our results demonstrate that the big.LITTLE architecture not only maintains high accuracy but also achieves substantial computational savings. Specifically, our approach enables the efficient handling of large-scale visual recognition tasks by dynamically balancing the trade-offs between performance and efficiency. The success of our method underscores the potential of hybrid models in optimizing both computation and performance in visual recognition tasks, paving the way for more practical and scalable deployment of advanced neural networks in real-world applications.

big.LITTLE Vision Transformer for Efficient Visual Recognition

TL;DR

The paper tackles the slow inference of Vision Transformers by introducing the big.LITTLE Vision Transformer (bLViT), a dual-block architecture that dynamically routes tokens between a high-capacity Performance block and a fast Efficiency block using token importance scores. By processing top tokens with the P-block and updating all tokens with the E-block, it preserves contextual information while dramatically reducing computation, achieving a per-layer cost of and a speedup over in theory. Training employs feature distillation from a vanilla ViT to mitigate pruning-induced performance loss, with a total loss and defined via cosine similarity between student and teacher features. Empirically, bLViT delivers strong accuracy on ImageNet-1K and competitive segmentation performance on SAM-related tasks, while reducing GFLOPs by roughly half for the lightweight B+T configuration, and maintaining robust results across larger P/E-block pairings. The work demonstrates a practical and scalable path to efficient, high-performance ViT-based vision systems suitable for real-world deployment.

Abstract

In this paper, we introduce the big.LITTLE Vision Transformer, an innovative architecture aimed at achieving efficient visual recognition. This dual-transformer system is composed of two distinct blocks: the big performance block, characterized by its high capacity and substantial computational demands, and the LITTLE efficiency block, designed for speed with lower capacity. The key innovation of our approach lies in its dynamic inference mechanism. When processing an image, our system determines the importance of each token and allocates them accordingly: essential tokens are processed by the high-performance big model, while less critical tokens are handled by the more efficient little model. This selective processing significantly reduces computational load without sacrificing the overall performance of the model, as it ensures that detailed analysis is reserved for the most important information. To validate the effectiveness of our big.LITTLE Vision Transformer, we conducted comprehensive experiments on image classification and segment anything task. Our results demonstrate that the big.LITTLE architecture not only maintains high accuracy but also achieves substantial computational savings. Specifically, our approach enables the efficient handling of large-scale visual recognition tasks by dynamically balancing the trade-offs between performance and efficiency. The success of our method underscores the potential of hybrid models in optimizing both computation and performance in visual recognition tasks, paving the way for more practical and scalable deployment of advanced neural networks in real-world applications.

Paper Structure

This paper contains 23 sections, 4 equations, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: Comparison between big.LITTLE and conventional token pruning and Performance of various token pruning strategies. The left diagram compares the standard ViT, token pruning which selectively removes less important tokens, and big.LITTLE ViT that integrates both high-capacity performance blocks (P-Block) and high-efficiency blocks (E-Block) for dynamic token processing. The right demonstrates the performance and efficiency of different models and our big.LITTLE ViT on the ImageNet classification task. Here, shape represents the baseline corresponding to the model. This visual comparison underscores the ability of big.LITTLE ViT to maintain high accuracy while significantly enhancing processing speed.
  • Figure 2: The Pipeline of big.LITTLE Vision Transformer module. Left: The module takes the image token sequence as input. The efficiency block (E-Block) updates all tokens with high speed. Then the importance scores from a prediction layer are used to select tokens, where a higher score means more important for the final prediction. The selected tokens are then fed into the performance block (P-Block) with a high capacity. Finally, we fuse the outputs from E-Block and P-Block to form new image representations. Right: P-Block uses semi cross attention to facilitate information interaction between the selected tokens and all tokens, while E-Block is a vanilla ViT block with dimension matching.
  • Figure 3: Token Selection Visualization. In bLViT, the tokens processed in the high-capacity P-block highlight areas crucial for image classification.