Rational Transductors
Mehryar Mohri
TL;DR
The paper introduces Rational Transductors, a hybrid architecture that merges Transformer-style semantic modeling with a linear, matrix-valued recurrence inspired by Weighted Finite Automata to guarantee robust sequential state tracking. By injecting rational state information into the attention stream through Deep Rational Injection, the model extends expressivity to all Regular Languages and NC1-complete problems while preserving $O(L + \log T)$ parallel depth, addressing the Regular Gap that plagues standard Transformers. Theoretical results show Random Rational Features form a universal basis for sequential dependencies and that differentiable rational features are necessary to close the expressivity gap, with a Krohn-Rhodes-based decomposition providing an algebraic foundation. Empirically, Rational Transductors demonstrate length-generalization and algorithmic generalization on parity, modulo counting, and long-integer addition tasks, while maintaining scalable training and inference via parallel scans. Together, these insights position RTs as a minimal, algebraically complete extension of finite-depth Transformers for reliable, long-horizon sequential reasoning with practical efficiency.
Abstract
Standard Transformers excel at semantic modeling but struggle with rigid sequential logic and state tracking. Theoretical work establishes that self-attention is limited to $\AC^0$ (under hard attention) or $\TC^0$ (under soft attention), complexity classes that often fail to support robust length generalization on sequential problems without intermediate chain-of-thought. In this work, we introduce \emph{Rational Transductors}, a dual-stream architecture that augments the Transformer with a matrix-valued recurrence derived from Weighted Finite Automata (WFA). By injecting rational state information into the attention mechanism via a \emph{Deep Rational Injection} scheme, our framework strictly generalizes the expressive power of Transformers to capture all Regular Languages, $\NC^1$-complete problems (such as Boolean Formula Evaluation), and fundamental separations like Parity and Modular Counting, while preserving $O(L + \log T)$ parallel time complexity. We ground the architecture in a rigorous learning theory: we prove that \emph{Random Rational Features} act as a universal basis for sequential dependencies, justifying our initialization strategy, while establishing that the \emph{Differentiable Rational Feature} regime is necessary to close the representational compactness gap. Theoretical analysis and empirical results demonstrate that Rational Transductors solve the "Regular Gap," enabling robust length generalization on algorithmic tasks where standard Transformers fail, without the sequential computational bottlenecks of traditional RNNs.
