Reactive Transformer (RxT) -- Stateful Real-Time Processing for Event-Driven Reactive Language Models
Adam Filipek
TL;DR
The paper addresses the inefficiency and latency of stateless LLMs in long, multi-turn dialogues by introducing the Reactive Transformer (RxT), an event-driven, stateful architecture that maintains a fixed-size Short-Term Memory (STM) and decouples response generation from memory updates through asynchronous processing. RxT redefines the Transformer flow into a cyclical interaction: a Generator-Decoder produces responses conditioned on STM, while a Memory Encoder and Memory Attention Network asynchronously update STM with representations of each interaction, yielding linear, user-facing cost with respect to the number of turns. A four-stage supervised training curriculum is proposed to stabilize learning across the generator, memory encoder, and memory attention components, addressing the cold-start problem and enabling effective end-to-end operation. Experimental results on synthetic data show RxT variants achieving lower perplexity, higher coherence rewards, and constant per-turn latency, demonstrating the practicality and efficiency of stateful, event-driven dialogue processing over traditional stateless approaches.
Abstract
The Transformer architecture has become the de facto standard for Large Language Models (LLMs), demonstrating remarkable capabilities in language understanding and generation. However, its application in conversational AI is fundamentally constrained by its stateless nature and the quadratic computational complexity ($O(L^2)$) with respect to sequence length $L$. Current models emulate memory by reprocessing an ever-expanding conversation history with each turn, leading to prohibitive costs and latency in long dialogues. This paper introduces the Reactive Transformer (RxT), a novel architecture designed to overcome these limitations by shifting from a data-driven to an event-driven paradigm. RxT processes each conversational turn as a discrete event in real-time, maintaining context in an integrated, fixed-size Short-Term Memory (STM) system. The architecture features a distinct operational cycle where a generator-decoder produces a response based on the current query and the previous memory state, after which a memory-encoder and a dedicated Memory Attention network asynchronously update the STM with a representation of the complete interaction. This design fundamentally alters the scaling dynamics, reducing the total user-facing cost of a conversation from quadratic ($O(N^2 \cdot T)$) to linear ($O(N \cdot T)$) with respect to the number of interactions $N$. By decoupling response generation from memory updates, RxT achieves low latency, enabling truly real-time, stateful, and economically viable long-form conversations. We validated our architecture with a series of proof-of-concept experiments on synthetic data, demonstrating superior performance and constant-time inference latency compared to a baseline stateless model of comparable size.
