Table of Contents
Fetching ...

Transformer^-1: Input-Adaptive Computation for Resource-Constrained Deployment

Lumen AI, Tengzhou No. 1 Middle School, Shihao Ji, Zihui Song, Fucheng Zhong, Jisen Jia, Zhaobo Wu, Zheyi Cao, Xu Tianhao

TL;DR

This work tackles resource waste from fixed-depth Transformers on resource-constrained devices by introducing a dynamic-depth Transformer inverse that adaptively allocates computation per input. It combines a complexity predictor and a reinforcement-learning policy to select computation paths, and provides a theoretical bound on dynamic computation alongside engineering solutions like layer folding and CUDA Graph pre-compilation to enable practical deployment. The approach yields substantial FLOPs and memory reductions with negligible accuracy loss on ImageNet-1K and shows generalization to NLP tasks and edge devices, demonstrating real-world applicability. Overall, Transformer inverse advances resource-efficient deep learning by enabling input-aware computation with strong empirical gains and practical deployment viability.

Abstract

Addressing the resource waste caused by fixed computation paradigms in deep learning models under dynamic scenarios, this paper proposes a Transformer$^{-1}$ architecture based on the principle of deep adaptivity. This architecture achieves dynamic matching between input features and computational resources by establishing a joint optimization model for complexity and computation. Our core contributions include: (1) designing a two-layer control mechanism, composed of a complexity predictor and a reinforcement learning policy network, enabling end-to-end optimization of computation paths; (2) deriving a lower bound theory for dynamic computation, proving the system's theoretical reach to optimal efficiency; and (3) proposing a layer folding technique and a CUDA Graph pre-compilation scheme, overcoming the engineering bottlenecks of dynamic architectures. In the ImageNet-1K benchmark test, our method reduces FLOPs by 42.7\% and peak memory usage by 34.1\% compared to the standard Transformer, while maintaining comparable accuracy ($\pm$0.3\%). Furthermore, we conducted practical deployment on the Jetson AGX Xavier platform, verifying the effectiveness and practical value of this method in resource-constrained environments. To further validate the generality of the method, we also conducted experiments on several natural language processing tasks and achieved significant improvements in resource efficiency.

Transformer^-1: Input-Adaptive Computation for Resource-Constrained Deployment

TL;DR

This work tackles resource waste from fixed-depth Transformers on resource-constrained devices by introducing a dynamic-depth Transformer inverse that adaptively allocates computation per input. It combines a complexity predictor and a reinforcement-learning policy to select computation paths, and provides a theoretical bound on dynamic computation alongside engineering solutions like layer folding and CUDA Graph pre-compilation to enable practical deployment. The approach yields substantial FLOPs and memory reductions with negligible accuracy loss on ImageNet-1K and shows generalization to NLP tasks and edge devices, demonstrating real-world applicability. Overall, Transformer inverse advances resource-efficient deep learning by enabling input-aware computation with strong empirical gains and practical deployment viability.

Abstract

Addressing the resource waste caused by fixed computation paradigms in deep learning models under dynamic scenarios, this paper proposes a Transformer architecture based on the principle of deep adaptivity. This architecture achieves dynamic matching between input features and computational resources by establishing a joint optimization model for complexity and computation. Our core contributions include: (1) designing a two-layer control mechanism, composed of a complexity predictor and a reinforcement learning policy network, enabling end-to-end optimization of computation paths; (2) deriving a lower bound theory for dynamic computation, proving the system's theoretical reach to optimal efficiency; and (3) proposing a layer folding technique and a CUDA Graph pre-compilation scheme, overcoming the engineering bottlenecks of dynamic architectures. In the ImageNet-1K benchmark test, our method reduces FLOPs by 42.7\% and peak memory usage by 34.1\% compared to the standard Transformer, while maintaining comparable accuracy (0.3\%). Furthermore, we conducted practical deployment on the Jetson AGX Xavier platform, verifying the effectiveness and practical value of this method in resource-constrained environments. To further validate the generality of the method, we also conducted experiments on several natural language processing tasks and achieved significant improvements in resource efficiency.

Paper Structure

This paper contains 30 sections, 19 equations, 7 tables, 1 algorithm.