Table of Contents
Fetching ...

AdaPerceiver: Transformers with Adaptive Width, Depth, and Tokens

Purvish Jajal, Nick John Eliopoulos, Benjamin Shiue-Hal Chou, George K. Thiruvathukal, Yung-Hsiang Lu, James C. Davis

TL;DR

AdaPerceiver introduces a unified transformer that adapts along three axes—tokens, depth, and width—at inference time. It achieves this with a latent-stream design, block-masked attention, and Matryoshka FFNs, trained via a once-for-all, joint objective that optimizes multiple configurations in a single forward pass. Empirical results across ImageNet-1K classification, ADE20K segmentation, and NYUv2 depth estimation show improved accuracy–throughput Pareto fronts and substantial encoder FLOP reductions compared to strong baselines. The work demonstrates practical gains for deployment under diverse hardware and latency constraints, and highlights policy-driven adaptivity as a viable route to further efficiency gains.

Abstract

Modern transformer architectures achieve remarkable performance across tasks and domains but remain rigid in how they allocate computation at inference time. Real-world deployment often requires models to adapt to diverse hardware and latency constraints, yet most approaches to dynamic computation focus on a single axis -- such as reducing the number of tokens. We present a novel capability: AdaPerceiver, the first transformer architecture with unified adaptivity across depth, width, and tokens within a single model. We propose an architecture that supports adaptivity along these axes. We couple this with an efficient joint training regime that ensures the model maintains performance across its various configurations. We evaluate AdaPerceiver on image classification, semantic segmentation, and depth estimation tasks. On image classification, AdaPerceiver expands the accuracy-throughput Pareto front. It achieves 85.4% accuracy while yielding 36% higher throughput than FlexiViT-L. On dense prediction, AdaPerceiver matches ViT-H/14 while having $\sim$26x fewer encoder FLOPs (floating-point operations) on semantic segmentation and depth estimation. Finally, we show how AdaPerceiver equipped with a policy can maintain ImageNet1K accuracy ($\pm0.1$ percentage points) while reducing FLOPs by $24-33$%.

AdaPerceiver: Transformers with Adaptive Width, Depth, and Tokens

TL;DR

AdaPerceiver introduces a unified transformer that adapts along three axes—tokens, depth, and width—at inference time. It achieves this with a latent-stream design, block-masked attention, and Matryoshka FFNs, trained via a once-for-all, joint objective that optimizes multiple configurations in a single forward pass. Empirical results across ImageNet-1K classification, ADE20K segmentation, and NYUv2 depth estimation show improved accuracy–throughput Pareto fronts and substantial encoder FLOP reductions compared to strong baselines. The work demonstrates practical gains for deployment under diverse hardware and latency constraints, and highlights policy-driven adaptivity as a viable route to further efficiency gains.

Abstract

Modern transformer architectures achieve remarkable performance across tasks and domains but remain rigid in how they allocate computation at inference time. Real-world deployment often requires models to adapt to diverse hardware and latency constraints, yet most approaches to dynamic computation focus on a single axis -- such as reducing the number of tokens. We present a novel capability: AdaPerceiver, the first transformer architecture with unified adaptivity across depth, width, and tokens within a single model. We propose an architecture that supports adaptivity along these axes. We couple this with an efficient joint training regime that ensures the model maintains performance across its various configurations. We evaluate AdaPerceiver on image classification, semantic segmentation, and depth estimation tasks. On image classification, AdaPerceiver expands the accuracy-throughput Pareto front. It achieves 85.4% accuracy while yielding 36% higher throughput than FlexiViT-L. On dense prediction, AdaPerceiver matches ViT-H/14 while having 26x fewer encoder FLOPs (floating-point operations) on semantic segmentation and depth estimation. Finally, we show how AdaPerceiver equipped with a policy can maintain ImageNet1K accuracy ( percentage points) while reducing FLOPs by %.

Paper Structure

This paper contains 90 sections, 15 equations, 17 figures, 7 tables, 4 algorithms.

Figures (17)

  • Figure 1: Overview of Adaptive Perceiver (AdaPerceiver). (a) AdaPerceiver architecture. The AdaPerceiver architecture consists of three streams: input, output and latent. Cross-attention blocks map input tokens to latent tokens and read out latent tokens to output tokens. The latent stream allows for an adaptive embedding and adaptive token dimensions. (b) The AdaPerceiver block follows a standard pre-norm transformer architecture dosovitskiy2020image, but replaces bi-directional self-attention with block mask attention (c). Its feed-forward network (FFN) is similar to MatFormer devvrit2024matformer, enabling adaptive embedding dimensions. (c)Block mask attention, is akin to self-attention in ViTs dosovitskiy2020image but instead applies Rotary Positional Encoding (RoPE) on the Q and K matrices su2024roformer and masks attention maps as shown in (d). This design enables adaptive token dimensions. (d) Visualization of block masking for $N$ tokens: Red denotes masked tokens, while other colours indicate unmasked tokens. Masking restricts attention interactions at each latent token granularity, ensuring that later tokens can attend to earlier ones, but not vice versa. We elaborate in \ref{['sec:method']}. N.B. The $\log_2$-spaced token granularity is arbitrary.
  • Figure 2: ImageNet-1K Evaluation. Accuracy vs. throughput (samples/sec) comparison of AdaPerceiver against state-of-the-art adaptive architectures. Each point corresponds to a distinct configuration. AdaPerceiver's width ($w=832$) and depth ($l=21$) are fixed while varying the number of tokens. It achieves the best accuracy–efficiency trade-off: in the high-accuracy regime it matches large models, and in the high-throughput regime it matches FlexiViT-Base. Throughput is measured with batch size 512. This figure is a truncated version of Appendix \ref{['fig:appendix:image-class:pareto-bidir-merged']}.
  • Figure 3: ImageNet-1K Configuration Tradeoffs. Accuracy vs. latency (ms) for AdaPerceiver under varying embedding dimensions and numbers of latent tokens. Note: each configuration (point) does not require retraining. Increasing the embedding dimension improves accuracy, while reducing the number of latent tokens decreases latency.
  • Figure 4: ADE20K Configuration Tradeoffs. Mean IoU vs. GFLOPs (encoder) for AdaPerceiver under varying embedding dimensions and latent tokens.
  • Figure 5: Depth Estimation Configuration Tradeoffs. RMSE vs. GFLOPs (encoder) for AdaPerceiver under varying embedding dimensions and latent tokens.
  • ...and 12 more figures