Table of Contents
Fetching ...

Rethinking Vision Transformer Depth via Structural Reparameterization

Chengwei Zhou, Vipin Chaudhary, Gourav Datta

TL;DR

This work tackles the depth–latency bottleneck of Vision Transformers by introducing a progressive structural reparameterization framework that trains parallel branches and gradually fuses them into ultra-shallow, single-path models for inference. By extracting layer normalization, reformulating multi-head attention in a blockwise, non-concatenation form, and progressively joining branches with a schedule, the method achieves exact reparameterization without approximation loss. Empirically, 6-, 4-, and 3-layer reparameterized ViTs can match or exceed the accuracy of 12-layer baselines on ImageNet-1K while delivering substantial deployment benefits, including up to ~37% CPU latency reduction and ~39% ARM latency improvement, with notable GPU throughput gains. These results challenge the necessity of deep ViTs and highlight practical opportunities for edge, mobile, and photonic/analog accelerators where parallelism and strict latency budgets dominate.

Abstract

The computational overhead of Vision Transformers in practice stems fundamentally from their deep architectures, yet existing acceleration strategies have primarily targeted algorithmic-level optimizations such as token pruning and attention speedup. This leaves an underexplored research question: can we reduce the number of stacked transformer layers while maintaining comparable representational capacity? To answer this, we propose a branch-based structural reparameterization technique that operates during the training phase. Our approach leverages parallel branches within transformer blocks that can be systematically consolidated into streamlined single-path models suitable for inference deployment. The consolidation mechanism works by gradually merging branches at the entry points of nonlinear components, enabling both feed-forward networks (FFN) and multi-head self-attention (MHSA) modules to undergo exact mathematical reparameterization without inducing approximation errors at test time. When applied to ViT-Tiny, the framework successfully reduces the original 12-layer architecture to 6, 4, or as few as 3 layers while maintaining classification accuracy on ImageNet-1K. The resulting compressed models achieve inference speedups of up to 37% on mobile CPU platforms. Our findings suggest that the conventional wisdom favoring extremely deep transformer stacks may be unnecessarily restrictive, and point toward new opportunities for constructing efficient vision transformers.

Rethinking Vision Transformer Depth via Structural Reparameterization

TL;DR

This work tackles the depth–latency bottleneck of Vision Transformers by introducing a progressive structural reparameterization framework that trains parallel branches and gradually fuses them into ultra-shallow, single-path models for inference. By extracting layer normalization, reformulating multi-head attention in a blockwise, non-concatenation form, and progressively joining branches with a schedule, the method achieves exact reparameterization without approximation loss. Empirically, 6-, 4-, and 3-layer reparameterized ViTs can match or exceed the accuracy of 12-layer baselines on ImageNet-1K while delivering substantial deployment benefits, including up to ~37% CPU latency reduction and ~39% ARM latency improvement, with notable GPU throughput gains. These results challenge the necessity of deep ViTs and highlight practical opportunities for edge, mobile, and photonic/analog accelerators where parallelism and strict latency budgets dominate.

Abstract

The computational overhead of Vision Transformers in practice stems fundamentally from their deep architectures, yet existing acceleration strategies have primarily targeted algorithmic-level optimizations such as token pruning and attention speedup. This leaves an underexplored research question: can we reduce the number of stacked transformer layers while maintaining comparable representational capacity? To answer this, we propose a branch-based structural reparameterization technique that operates during the training phase. Our approach leverages parallel branches within transformer blocks that can be systematically consolidated into streamlined single-path models suitable for inference deployment. The consolidation mechanism works by gradually merging branches at the entry points of nonlinear components, enabling both feed-forward networks (FFN) and multi-head self-attention (MHSA) modules to undergo exact mathematical reparameterization without inducing approximation errors at test time. When applied to ViT-Tiny, the framework successfully reduces the original 12-layer architecture to 6, 4, or as few as 3 layers while maintaining classification accuracy on ImageNet-1K. The resulting compressed models achieve inference speedups of up to 37% on mobile CPU platforms. Our findings suggest that the conventional wisdom favoring extremely deep transformer stacks may be unnecessarily restrictive, and point toward new opportunities for constructing efficient vision transformers.

Paper Structure

This paper contains 24 sections, 8 equations, 6 figures, 11 tables, 1 algorithm.

Figures (6)

  • Figure 1: Proposed Depth Compression Framework for ViTs
  • Figure 2: Effect of different Lambda Warmup Periods on ImageNet-1K D-MAE-6-R fine-tuning accuracy with linear lambda scheduler. The red point indicates the best performing warmup (10k steps).
  • Figure 3: Visualization of different joining functions during the Lambda Warmup Period ($t \in [0,10k]$). After the warmup (e.g., 10k steps), the Post-Joining Adjustment Phase begins with $\lambda=1$.
  • Figure 4: Feature similarity heatmap across branches in the pre-trained D-MAE-3-R model before progressive joining. The visualization shows cosine similarity between feature embeddings from attention and FFN outputs of each branch.
  • Figure 5: Weight similarity matrices across branches before progressive joining. This analysis corroborates the feature diversity observed in Figure \ref{['fig:sim_features']}, confirming that branches maintain distinct parameter spaces in D-MAE-3-4.
  • ...and 1 more figures