Table of Contents
Fetching ...

Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions

Zhihao He, Hang Yu, Zi Gong, Shizhan Liu, Jianguo Li, Weiyao Lin

TL;DR

This paper tackles the high computational cost of softmax attention in Transformers by introducing Rodimus, a linear-attention, purely recurrent model, and Rodimus$, a hybrid that combines Rodimus with Sliding Window Shared-Key Attention (SW-SKA). Key to their approach is the data-dependent tempered selection (DDTS) mechanism, which compresses history into a fixed-size hidden state while preserving essential information, enabling $O(1)$ per-token inference and sub-quadratic training. Rodimus$+$ further fuses semantic, token, and head compression, employing Shared-Key Attention to achieve lossless KV compression and SW-SKA for local context with a two-hop residual that tightly couples token and channel mixing. Across WikiText-103 and Pile-scale language modeling, as well as MQAR and NeedleBench recall benchmarks, Rodimus and Rodimus$+$ consistently outperform or match state-of-the-art recurrent and sparse-attention models with substantially reduced memory footprints, and Rodimus$+$ demonstrates the strongest gains at larger scales. The work shows that, with carefully designed gates and hybrid attention, recurrence can approach or exceed the performance of full softmax attention at a fraction of the computational and memory cost, signaling a practical path toward efficient, scalable LLMs. The authors also release open-source code and pre-trained checkpoints, encouraging broader adoption and further refinement of efficient recurrent architectures for NLP and code-understanding tasks.

Abstract

Recent advancements in Transformer-based large language models (LLMs) have set new standards in natural language processing. However, the classical softmax attention incurs significant computational costs, leading to a $O(T)$ complexity for per-token generation, where $T$ represents the context length. This work explores reducing LLMs' complexity while maintaining performance by introducing Rodimus and its enhanced version, Rodimus$+$. Rodimus employs an innovative data-dependent tempered selection (DDTS) mechanism within a linear attention-based, purely recurrent framework, achieving significant accuracy while drastically reducing the memory usage typically associated with recurrent models. This method exemplifies semantic compression by maintaining essential input information with fixed-size hidden states. Building on this, Rodimus$+$ combines Rodimus with the innovative Sliding Window Shared-Key Attention (SW-SKA) in a hybrid approach, effectively leveraging the complementary semantic, token, and head compression techniques. Our experiments demonstrate that Rodimus$+$-1.6B, trained on 1 trillion tokens, achieves superior downstream performance against models trained on more tokens, including Qwen2-1.5B and RWKV6-1.6B, underscoring its potential to redefine the accuracy-efficiency balance in LLMs. Model code and pre-trained checkpoints are open-sourced at https://github.com/codefuse-ai/rodimus.

Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions

TL;DR

This paper tackles the high computational cost of softmax attention in Transformers by introducing Rodimus, a linear-attention, purely recurrent model, and RodimusO(1)+++$ demonstrates the strongest gains at larger scales. The work shows that, with carefully designed gates and hybrid attention, recurrence can approach or exceed the performance of full softmax attention at a fraction of the computational and memory cost, signaling a practical path toward efficient, scalable LLMs. The authors also release open-source code and pre-trained checkpoints, encouraging broader adoption and further refinement of efficient recurrent architectures for NLP and code-understanding tasks.

Abstract

Recent advancements in Transformer-based large language models (LLMs) have set new standards in natural language processing. However, the classical softmax attention incurs significant computational costs, leading to a complexity for per-token generation, where represents the context length. This work explores reducing LLMs' complexity while maintaining performance by introducing Rodimus and its enhanced version, Rodimus. Rodimus employs an innovative data-dependent tempered selection (DDTS) mechanism within a linear attention-based, purely recurrent framework, achieving significant accuracy while drastically reducing the memory usage typically associated with recurrent models. This method exemplifies semantic compression by maintaining essential input information with fixed-size hidden states. Building on this, Rodimus combines Rodimus with the innovative Sliding Window Shared-Key Attention (SW-SKA) in a hybrid approach, effectively leveraging the complementary semantic, token, and head compression techniques. Our experiments demonstrate that Rodimus-1.6B, trained on 1 trillion tokens, achieves superior downstream performance against models trained on more tokens, including Qwen2-1.5B and RWKV6-1.6B, underscoring its potential to redefine the accuracy-efficiency balance in LLMs. Model code and pre-trained checkpoints are open-sourced at https://github.com/codefuse-ai/rodimus.

Paper Structure

This paper contains 44 sections, 2 theorems, 22 equations, 15 figures, 22 tables.

Key Result

Proposition 1

Given the specifications for ${\bm{A}}_t$ and ${\bm{B}}_t$ in Eqs. (eq:ssm_equivalent), (eq:general_state_transition), (eq:alpha_def), and (eq:beta_def), DDTS can realize the selection between the previous state ${\bm{S}}_{t-1}$ and the current input ${\bm{u}}_t$.

Figures (15)

  • Figure 1: Memory Footprint vs. Performance: (a) This experiment is conducted on the WikiText-103 dataset (Details in Appendix \ref{['app:analysis_mem_size']}), and focuses on the recorded best perplexity (PPL). The memory footprint is adjusted by modifying the expansion factor. (b) The model's recall capability is assessed using the MQAR Task (See Appendix \ref{['app:mqar']}), as described in arora2024zoology. Among all models evaluated, Rodimus* achieves the optimal balance between space complexity and performance.
  • Figure 1: Results on WikiText-103.
  • Figure 2: Overview of the Proposed Models. The Rodimus* Model serves as a template for both Rodimus and Rodimus$+$. When modules within the gray dashed box are included, it becomes the Rodimus$+$ Model; otherwise, it is the Rodimus Model. The architecture comprises $L$ layers of stacked Rodimus* Blocks along with essential modules for language modeling (e.g., Embedding, RMSNorm, LM Head). Rodimus Flow (\ref{['eq:rodimus_recurrent_form']}) denotes our proposed recurrent computation method. The purple arrows depict information flow between layers (ignoring RMSNorm). In the Rodimus$+$ Model, the Rodimus block compresses the Token Embedding into a global semantic embedding, enhancing SW-SKA's ability to perceive long contexts.
  • Figure 3: Comparison of Different Attention Mechanisms. In MQA and GQA, values are shared among the same group of query heads, resulting in lossy compression compared to the individual head-specific values used in MHA. SKA maintains the multi-value setting of MHA but uses a shared key across all heads. This approach produces a separate attention map for each head, preserving the expressiveness of MHA while reducing the memory footprint.
  • Figure 4: Scaling Curve based on Scaling Laws.
  • ...and 10 more figures

Theorems & Definitions (4)

  • Proposition 1
  • proof
  • Lemma 1
  • proof