Hydra: A Modular Architecture for Efficient Long-Context Reasoning

Siddharth Chaudhary; Dev Patel; Maheep Chaudhary; Bennett Browning

Hydra: A Modular Architecture for Efficient Long-Context Reasoning

Siddharth Chaudhary, Dev Patel, Maheep Chaudhary, Bennett Browning

TL;DR

Transformers struggle with long-context reasoning due to $O(L^2)$ self-attention. Hydra addresses this by combining a Structured State Space Model backbone with adaptive routing to sparse global attention, mixture-of-experts, and dual memories (Workspace and PKM), forming a modular, decoder-only architecture. On toy-scale benchmarks, Hydra achieves up to $3.0\times$ throughput gains at 8K tokens and up to $10\times$ improvements in multi-step reasoning accuracy, with ablations confirming the distinct benefits of each module. The results suggest that modular efficiency mechanisms can jointly reduce compute cost and enhance reasoning and recall, enabling practical long-context reasoning in resource-constrained settings.

Abstract

The quadratic complexity of transformers fundamentally limits reasoning system deployment in resource-constrained and long-context settings. We introduce Hydra, a modular architecture based upon a state-space backbone which adaptively routes between complementary efficiency mechanisms: sparse global attention, mixture-of-experts, and dual memories comprising a reasoning workspace and product key memory. We evaluate a 29M parameter model measuring logical chaining accuracy and throughput on synthetic sequences, plus throughput on WikiText. Ablation studies use component-specific synthetic datasets to isolate individual mechanisms. Hydra achieves $3.01\times$ and $3.0\times$ throughput gains at 8K tokens for synthetic and WikiText datasets, respectively, and $10\times$ accuracy improvements on multi-step logical composition compared to equal-sized transformers. Ablations confirm each component's contribution: sparse attention captures long-range dependencies, experts specialize to input domains, and product key memory enables selective retrieval.

Hydra: A Modular Architecture for Efficient Long-Context Reasoning

TL;DR

Transformers struggle with long-context reasoning due to

self-attention. Hydra addresses this by combining a Structured State Space Model backbone with adaptive routing to sparse global attention, mixture-of-experts, and dual memories (Workspace and PKM), forming a modular, decoder-only architecture. On toy-scale benchmarks, Hydra achieves up to

throughput gains at 8K tokens and up to

improvements in multi-step reasoning accuracy, with ablations confirming the distinct benefits of each module. The results suggest that modular efficiency mechanisms can jointly reduce compute cost and enhance reasoning and recall, enabling practical long-context reasoning in resource-constrained settings.

Abstract

and

throughput gains at 8K tokens for synthetic and WikiText datasets, respectively, and

accuracy improvements on multi-step logical composition compared to equal-sized transformers. Ablations confirm each component's contribution: sparse attention captures long-range dependencies, experts specialize to input domains, and product key memory enables selective retrieval.

Hydra: A Modular Architecture for Efficient Long-Context Reasoning

TL;DR

Abstract

Hydra: A Modular Architecture for Efficient Long-Context Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)