Table of Contents
Fetching ...

Rational Transductors

Mehryar Mohri

TL;DR

The paper introduces Rational Transductors, a hybrid architecture that merges Transformer-style semantic modeling with a linear, matrix-valued recurrence inspired by Weighted Finite Automata to guarantee robust sequential state tracking. By injecting rational state information into the attention stream through Deep Rational Injection, the model extends expressivity to all Regular Languages and NC1-complete problems while preserving $O(L + \log T)$ parallel depth, addressing the Regular Gap that plagues standard Transformers. Theoretical results show Random Rational Features form a universal basis for sequential dependencies and that differentiable rational features are necessary to close the expressivity gap, with a Krohn-Rhodes-based decomposition providing an algebraic foundation. Empirically, Rational Transductors demonstrate length-generalization and algorithmic generalization on parity, modulo counting, and long-integer addition tasks, while maintaining scalable training and inference via parallel scans. Together, these insights position RTs as a minimal, algebraically complete extension of finite-depth Transformers for reliable, long-horizon sequential reasoning with practical efficiency.

Abstract

Standard Transformers excel at semantic modeling but struggle with rigid sequential logic and state tracking. Theoretical work establishes that self-attention is limited to $\AC^0$ (under hard attention) or $\TC^0$ (under soft attention), complexity classes that often fail to support robust length generalization on sequential problems without intermediate chain-of-thought. In this work, we introduce \emph{Rational Transductors}, a dual-stream architecture that augments the Transformer with a matrix-valued recurrence derived from Weighted Finite Automata (WFA). By injecting rational state information into the attention mechanism via a \emph{Deep Rational Injection} scheme, our framework strictly generalizes the expressive power of Transformers to capture all Regular Languages, $\NC^1$-complete problems (such as Boolean Formula Evaluation), and fundamental separations like Parity and Modular Counting, while preserving $O(L + \log T)$ parallel time complexity. We ground the architecture in a rigorous learning theory: we prove that \emph{Random Rational Features} act as a universal basis for sequential dependencies, justifying our initialization strategy, while establishing that the \emph{Differentiable Rational Feature} regime is necessary to close the representational compactness gap. Theoretical analysis and empirical results demonstrate that Rational Transductors solve the "Regular Gap," enabling robust length generalization on algorithmic tasks where standard Transformers fail, without the sequential computational bottlenecks of traditional RNNs.

Rational Transductors

TL;DR

The paper introduces Rational Transductors, a hybrid architecture that merges Transformer-style semantic modeling with a linear, matrix-valued recurrence inspired by Weighted Finite Automata to guarantee robust sequential state tracking. By injecting rational state information into the attention stream through Deep Rational Injection, the model extends expressivity to all Regular Languages and NC1-complete problems while preserving parallel depth, addressing the Regular Gap that plagues standard Transformers. Theoretical results show Random Rational Features form a universal basis for sequential dependencies and that differentiable rational features are necessary to close the expressivity gap, with a Krohn-Rhodes-based decomposition providing an algebraic foundation. Empirically, Rational Transductors demonstrate length-generalization and algorithmic generalization on parity, modulo counting, and long-integer addition tasks, while maintaining scalable training and inference via parallel scans. Together, these insights position RTs as a minimal, algebraically complete extension of finite-depth Transformers for reliable, long-horizon sequential reasoning with practical efficiency.

Abstract

Standard Transformers excel at semantic modeling but struggle with rigid sequential logic and state tracking. Theoretical work establishes that self-attention is limited to (under hard attention) or (under soft attention), complexity classes that often fail to support robust length generalization on sequential problems without intermediate chain-of-thought. In this work, we introduce \emph{Rational Transductors}, a dual-stream architecture that augments the Transformer with a matrix-valued recurrence derived from Weighted Finite Automata (WFA). By injecting rational state information into the attention mechanism via a \emph{Deep Rational Injection} scheme, our framework strictly generalizes the expressive power of Transformers to capture all Regular Languages, -complete problems (such as Boolean Formula Evaluation), and fundamental separations like Parity and Modular Counting, while preserving parallel time complexity. We ground the architecture in a rigorous learning theory: we prove that \emph{Random Rational Features} act as a universal basis for sequential dependencies, justifying our initialization strategy, while establishing that the \emph{Differentiable Rational Feature} regime is necessary to close the representational compactness gap. Theoretical analysis and empirical results demonstrate that Rational Transductors solve the "Regular Gap," enabling robust length generalization on algorithmic tasks where standard Transformers fail, without the sequential computational bottlenecks of traditional RNNs.
Paper Structure (87 sections, 30 theorems, 50 equations, 11 figures, 2 tables)

This paper contains 87 sections, 30 theorems, 50 equations, 11 figures, 2 tables.

Key Result

Proposition 1

A cascade of $K$ linear Weighted Finite Automata, where the state of automaton $k$ depends linearly on the state of automaton $k-1$, is algebraically equivalent to a single WFA with a larger state space dimension $d_{\text{total}} = \sum_{k=1}^K d_k$.

Figures (11)

  • Figure 1: Visualizing the Rational State Update. The hidden state vector $\bh_t$ (right) is computed as a linear transformation of the previous state $\bh_{t-1}$ (left). Each component $h_{t,i}$ aggregates the weighted paths from the previous step, illustrating the "sum of paths" definition.
  • Figure 2: The Rational Transductor Architecture. The Rational Head extracts state variables $\bh_t$. These states are injected into the Attention Stream via layer-specific projections $\sfW^{(l)}$, augmenting the semantic hidden states $\bz_t^{(l)}$.
  • Figure 3: The Universal Rational Transductor. The architecture instantiates parallel heads with distinct dynamical biases: Orthogonal (top) for infinite memory and Stochastic (bottom) for discrete switching. These independent features are concatenated, corresponding to the direct sum ($\oplus$) of the underlying automata.
  • Figure 4: Architectural Comparison. (a) Wide Recurrence: The Rational Transductor computes a single high-dimensional state $h_t$ directly from the input via a parallel scan, injecting it into all layers. (b) Deep Recurrence: Stacked architectures (e.g., H3, Mamba) interleave recurrence, where Layer $k$ depends on the output of Layer $k-1$, reintroducing a sequential bottleneck during training.
  • Figure 5: State tracking mechanisms for exact regular languages. (a) The Parity WFA uses a 2-state flip mechanism to track $L_{\text{parity}}$. (b) The Modulo-3 WFA generalizes this to a cyclic group structure to solve $L_k$ for $k=3$. Input '0' acts as the Identity $\sfI$ (self-loop), while input '1' acts as a permutation.
  • ...and 6 more figures

Theorems & Definitions (59)

  • Proposition 1: Reducibility of Cascaded WFAs
  • proof
  • theorem 2: Cascaded Parameter Efficiency
  • theorem 3: Non-Linear Irreducibility
  • proof
  • Lemma 4: Positional Encodings are Rational
  • proof
  • theorem 5: The Parity Gap
  • proof
  • theorem 6: Exact Modular Counting
  • ...and 49 more