Table of Contents
Fetching ...

Olmo Hybrid: From Theory to Practice and Back

William Merrill, Yanhong Li, Tyler Romero, Anej Svete, Caia Costello, Pradeep Dasigi, Dirk Groeneveld, David Heineman, Bailey Kuehl, Nathan Lambert, Chuan Li, Kyle Lo, Saumya Malik, DJ Matusz, Benjamin Minixhofer, Jacob Morrison, Luca Soldaini, Finbarr Timbers, Pete Walsh, Noah A. Smith, Hannaneh Hajishirzi, Ashish Sabharwal

Abstract

Recent work has demonstrated the potential of non-transformer language models, especially linear recurrent neural networks (RNNs) and hybrid models that mix recurrence and attention. Yet there is no consensus on whether the potential benefits of these new architectures justify the risk and effort of scaling them up. To address this, we provide evidence for the advantages of hybrid models over pure transformers on several fronts. First, theoretically, we show that hybrid models do not merely inherit the expressivity of transformers and linear RNNs, but can express tasks beyond both, such as code execution. Putting this theory to practice, we train Olmo Hybrid, a 7B-parameter model largely comparable to Olmo 3 7B but with the sliding window layers replaced by Gated DeltaNet layers. We show that Olmo Hybrid outperforms Olmo 3 across standard pretraining and mid-training evaluations, demonstrating the benefit of hybrid models in a controlled, large-scale setting. We find that the hybrid model scales significantly more efficiently than the transformer, explaining its higher performance. However, its unclear why greater expressivity on specific formal problems should result in better scaling or superior performance on downstream tasks unrelated to those problems. To explain this apparent gap, we return to theory and argue why increased expressivity should translate to better scaling efficiency, completing the loop. Overall, our results suggest that hybrid models mixing attention and recurrent layers are a powerful extension to the language modeling paradigm: not merely to reduce memory during inference, but as a fundamental way to obtain more expressive models that scale better during pretraining.

Olmo Hybrid: From Theory to Practice and Back

Abstract

Recent work has demonstrated the potential of non-transformer language models, especially linear recurrent neural networks (RNNs) and hybrid models that mix recurrence and attention. Yet there is no consensus on whether the potential benefits of these new architectures justify the risk and effort of scaling them up. To address this, we provide evidence for the advantages of hybrid models over pure transformers on several fronts. First, theoretically, we show that hybrid models do not merely inherit the expressivity of transformers and linear RNNs, but can express tasks beyond both, such as code execution. Putting this theory to practice, we train Olmo Hybrid, a 7B-parameter model largely comparable to Olmo 3 7B but with the sliding window layers replaced by Gated DeltaNet layers. We show that Olmo Hybrid outperforms Olmo 3 across standard pretraining and mid-training evaluations, demonstrating the benefit of hybrid models in a controlled, large-scale setting. We find that the hybrid model scales significantly more efficiently than the transformer, explaining its higher performance. However, its unclear why greater expressivity on specific formal problems should result in better scaling or superior performance on downstream tasks unrelated to those problems. To explain this apparent gap, we return to theory and argue why increased expressivity should translate to better scaling efficiency, completing the loop. Overall, our results suggest that hybrid models mixing attention and recurrent layers are a powerful extension to the language modeling paradigm: not merely to reduce memory during inference, but as a fundamental way to obtain more expressive models that scale better during pretraining.

Paper Structure

This paper contains 119 sections, 13 theorems, 50 equations, 21 figures, 22 tables.

Key Result

Theorem 2

With polynomial padding tokens, fixed-depth transformers with averaging-hard attention recognize exactly $\mathsf{FO}$-uniform $\mathsf{TC}^0$. $\blacktriangleleft$$\blacktriangleleft$

Figures (21)

  • Figure 1: Olmo Hybrid 7B is more efficient than OLMo 3 7B during pretraining, reaching the same Common Crawl loss in 35% fewer tokens and the same MMLU accuracy using 49% fewer tokens (and thus also 35% and 49% fewer FLOPs, respectively).
  • Figure 2: Compute-performance tradeoff of open-weight hybrid, RNN, and transformer base models on the average of OlmoBaseEval task suites. Olmo Hybrid 7B is on the Pareto frontier of open-weight dense models. Training compute was estimated using the $C=6ND$ heuristic from kaplan2020scalinglaws, with reported token and parameter counts. Per-benchmark results are reported in Table \ref{['tab:base-eval']}. While theoretical compute for MoE-based SSMs is low, effective training efficiency can be roughly 50--80% of a dense model rajbhandari2022deepspeed, so we draw our frontier over dense models only. For this plot, we show only models obtaining >50% average performance on OlmoBaseEval.
  • Figure 3: Expressive power of transformers, linear RNNs, and hybrid models relative to circuit complexity classes. Dashed lines represent unproven but conjectured separations between classes (e.g., $\mathsf{TC}^0 \neq \mathsf{NC}^1$). Notably, transformers can express recall (\ref{['sec:transformer-expressivity']}), and DeltaNet (or GDN) with negative eigenvalues can express state tracking (\ref{['sec:rnn-expressivity']}) grazzi2025unlocking. Hybridizing gives both capabilities, and we prove that it also unlocks state-based recall, a problem that neither model can express on its own (\ref{['sec:hybrid-expressivity']}). With the addition of padding tokens, transformers can express exactly the class $\mathsf{TC}^0$, whereas hybrid models can capture all of $\mathsf{NC}^1$, which enables solving boolean formula evaluation (\ref{['sec:expressivity-exact']}).
  • Figure 4: Code evaluation contexts where predicting the next token requires solving state tracking (left) and recall (right). As the number of lines $n$ grows, fixed-depth transformers cannot represent state tracking, assuming $\mathsf{TC}^0 \neq \mathsf{NC}^1$merrill2024illusion. As the number of bits $m$ grows, RNNs with sub-linear precision cannot solve recall because of their bounded state arora2024basedjelassi2024repeat. Hybrid models can represent both problems.
  • Figure 5: A code evaluation context where predicting the next token requires solving state-based recall. As $n$ increases, the task becomes inexpressible by transformers because the variable states cannot be tracked (assuming $\mathsf{TC}^0 \neq \mathsf{NC}^1$). As $m$ grows, the task becomes inexpressible by RNNs because recall into the bit array requires more memory than their bounded state can hold. There exists a simple hybrid model that can solve the task robustly for any value of $n$ and $m$.
  • ...and 16 more figures

Theorems & Definitions (30)

  • Definition 1: GDN with Negative Eigenvalues schlag2021lineartransformersyang2025gdngrazzi2025unlocking
  • Definition 2: State-Based Recall (\ref{['fig:pointer-based-recall']})
  • proof : Proof Sketch
  • Theorem 2: Padded Transformers merrill2025exact
  • proof : Proof Sketch
  • Corollary 3.1: Boolean Formula Evaluation Separation
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • ...and 20 more