Hydra: A Modular Architecture for Efficient Long-Context Reasoning
Siddharth Chaudhary, Dev Patel, Maheep Chaudhary, Bennett Browning
TL;DR
Transformers struggle with long-context reasoning due to $O(L^2)$ self-attention. Hydra addresses this by combining a Structured State Space Model backbone with adaptive routing to sparse global attention, mixture-of-experts, and dual memories (Workspace and PKM), forming a modular, decoder-only architecture. On toy-scale benchmarks, Hydra achieves up to $3.0\times$ throughput gains at 8K tokens and up to $10\times$ improvements in multi-step reasoning accuracy, with ablations confirming the distinct benefits of each module. The results suggest that modular efficiency mechanisms can jointly reduce compute cost and enhance reasoning and recall, enabling practical long-context reasoning in resource-constrained settings.
Abstract
The quadratic complexity of transformers fundamentally limits reasoning system deployment in resource-constrained and long-context settings. We introduce Hydra, a modular architecture based upon a state-space backbone which adaptively routes between complementary efficiency mechanisms: sparse global attention, mixture-of-experts, and dual memories comprising a reasoning workspace and product key memory. We evaluate a 29M parameter model measuring logical chaining accuracy and throughput on synthetic sequences, plus throughput on WikiText. Ablation studies use component-specific synthetic datasets to isolate individual mechanisms. Hydra achieves $3.01\times$ and $3.0\times$ throughput gains at 8K tokens for synthetic and WikiText datasets, respectively, and $10\times$ accuracy improvements on multi-step logical composition compared to equal-sized transformers. Ablations confirm each component's contribution: sparse attention captures long-range dependencies, experts specialize to input domains, and product key memory enables selective retrieval.
