Dynamic Rank Reinforcement Learning for Adaptive Low-Rank Multi-Head Self Attention in Large Language Models
Caner Erden
TL;DR
This work tackles the heavy computational cost of Multi-Head Self-Attention in large language models by introducing Dynamic Rank Reinforcement Learning (DR-RL), a framework that adaptively selects the low-rank representation of attention via reinforcement learning guided by online matrix perturbation theory. By modeling rank selection as an MDP and enforcing perturbation-based safety, DR-RL delivers context- and hardware-aware efficiency, allocating higher ranks to linguistically dense or complex regions while reducing computation for simpler segments. The approach is backed by incremental SVD updates, a Transformer-based policy, and a hybrid training regime, achieving near-full-rank accuracy with substantial FLOP reductions, especially for long sequences. Empirical results across standard benchmarks and ablations demonstrate the method's effectiveness and robustness, and the authors provide reusable code to foster reproducibility and further exploration in adaptive, self-optimizing attention mechanisms.
Abstract
We propose Dynamic Rank Reinforcement Learning (DR-RL), a novel framework that adaptively optimizes the low-rank factorization of Multi-Head Self-Attention (MHSA) in Large Language Models (LLMs) through the integration of reinforcement learning and online matrix perturbation theory. While traditional low-rank approximations often rely on static rank assumptions--limiting their flexibility across diverse input contexts--our method dynamically selects ranks based on real-time sequence dynamics, layer-specific sensitivities, and hardware constraints. The core innovation lies in an RL agent that formulates rank selection as a sequential policy optimization problem, where the reward function strictly balances attention fidelity against computational latency. Crucially, we employ online matrix perturbation bounds to enable incremental rank updates, thereby avoiding the prohibitive cost of full decomposition during inference. Furthermore, the integration of a lightweight Transformer-based policy network and batched Singular Value Decomposition (SVD) operations ensures scalable deployment on modern GPU architectures. Experiments demonstrate that DR-RL maintains downstream accuracy statistically equivalent to full-rank attention while significantly reducing Floating Point Operations (FLOPs), particularly in long-sequence regimes (L > 4096). This work bridges the gap between adaptive efficiency and theoretical rigor in MHSA, offering a principled, mathematically grounded alternative to heuristic rank reduction techniques in resource-constrained deep learning. Source code and experiment logs are available at: https://github.com/canererden/DR_RL_Project
