Transformer^-1: Input-Adaptive Computation for Resource-Constrained Deployment
Lumen AI, Tengzhou No. 1 Middle School, Shihao Ji, Zihui Song, Fucheng Zhong, Jisen Jia, Zhaobo Wu, Zheyi Cao, Xu Tianhao
TL;DR
This work tackles resource waste from fixed-depth Transformers on resource-constrained devices by introducing a dynamic-depth Transformer inverse that adaptively allocates computation per input. It combines a complexity predictor and a reinforcement-learning policy to select computation paths, and provides a theoretical bound on dynamic computation alongside engineering solutions like layer folding and CUDA Graph pre-compilation to enable practical deployment. The approach yields substantial FLOPs and memory reductions with negligible accuracy loss on ImageNet-1K and shows generalization to NLP tasks and edge devices, demonstrating real-world applicability. Overall, Transformer inverse advances resource-efficient deep learning by enabling input-aware computation with strong empirical gains and practical deployment viability.
Abstract
Addressing the resource waste caused by fixed computation paradigms in deep learning models under dynamic scenarios, this paper proposes a Transformer$^{-1}$ architecture based on the principle of deep adaptivity. This architecture achieves dynamic matching between input features and computational resources by establishing a joint optimization model for complexity and computation. Our core contributions include: (1) designing a two-layer control mechanism, composed of a complexity predictor and a reinforcement learning policy network, enabling end-to-end optimization of computation paths; (2) deriving a lower bound theory for dynamic computation, proving the system's theoretical reach to optimal efficiency; and (3) proposing a layer folding technique and a CUDA Graph pre-compilation scheme, overcoming the engineering bottlenecks of dynamic architectures. In the ImageNet-1K benchmark test, our method reduces FLOPs by 42.7\% and peak memory usage by 34.1\% compared to the standard Transformer, while maintaining comparable accuracy ($\pm$0.3\%). Furthermore, we conducted practical deployment on the Jetson AGX Xavier platform, verifying the effectiveness and practical value of this method in resource-constrained environments. To further validate the generality of the method, we also conducted experiments on several natural language processing tasks and achieved significant improvements in resource efficiency.
