Table of Contents
Fetching ...

ReSSFormer: A Recursive Sparse Structured Transformer for Scalable and Long-Context Reasoning

Haochen You, Baojing Liu

TL;DR

ReSSFormer introduces a Recursive Sparse Structured Transformer that replaces deep stacking with a recurrent reasoning unit (R2MU), employs token- and expert-level sparse attention via ASAM, and eliminates fixed positional encodings through a self-organizing encoder (SOES) that learns latent token graphs. The model operates over $K$ iterative steps, maintaining a memory that combines a token-level cache and a segment-level summary, updated by a learnable gating mechanism to support multi-step reasoning with sublinear parameter growth. ASAM narrows attention through sparsity in activations, top-$k$ key selection, and Mixture-of-Experts routing to scale efficiently to long contexts. SOES enables structure-aware processing by deriving a content-dependent graph with a smooth evolution penalty, allowing robust handling of unordered data, tables, and graphs. Experiments across long-context QA, language modeling, and structure-sensitive tasks show ReSSFormer achieves superior accuracy and efficiency under comparable FLOPs and parameter budgets, with strong robustness to noise and without relying on positional priors.

Abstract

While Transformer architectures have demonstrated impressive scalability across domains, they continue to face challenges in long-context reasoning, computational efficiency, and structural generalization - largely due to rigid layer stacking, dense attention, and reliance on positional encodings. We present ReSSFormer, a Recursive Sparse Structured Transformer that integrates three complementary innovations: Recurrent Reasoning & Memory Unit (R2MU) for iterative reasoning with bounded depth, Adaptive Sparse Attention Module (ASAM) for efficient and focused context selection, and Self-Organizing Encoder Structure (SOES) for position-free structure induction. ReSSFormer replaces conventional depth stacking with recurrent inference, substitutes full attention with token- and expert-level sparsity, and models latent token topology directly from content. Across language modeling, multi-hop QA, and structure-sensitive tasks, ReSSFormer consistently outperforms strong baselines under comparable FLOPs and parameter budgets, highlighting its scalability, efficiency, and structural flexibility.

ReSSFormer: A Recursive Sparse Structured Transformer for Scalable and Long-Context Reasoning

TL;DR

ReSSFormer introduces a Recursive Sparse Structured Transformer that replaces deep stacking with a recurrent reasoning unit (R2MU), employs token- and expert-level sparse attention via ASAM, and eliminates fixed positional encodings through a self-organizing encoder (SOES) that learns latent token graphs. The model operates over iterative steps, maintaining a memory that combines a token-level cache and a segment-level summary, updated by a learnable gating mechanism to support multi-step reasoning with sublinear parameter growth. ASAM narrows attention through sparsity in activations, top- key selection, and Mixture-of-Experts routing to scale efficiently to long contexts. SOES enables structure-aware processing by deriving a content-dependent graph with a smooth evolution penalty, allowing robust handling of unordered data, tables, and graphs. Experiments across long-context QA, language modeling, and structure-sensitive tasks show ReSSFormer achieves superior accuracy and efficiency under comparable FLOPs and parameter budgets, with strong robustness to noise and without relying on positional priors.

Abstract

While Transformer architectures have demonstrated impressive scalability across domains, they continue to face challenges in long-context reasoning, computational efficiency, and structural generalization - largely due to rigid layer stacking, dense attention, and reliance on positional encodings. We present ReSSFormer, a Recursive Sparse Structured Transformer that integrates three complementary innovations: Recurrent Reasoning & Memory Unit (R2MU) for iterative reasoning with bounded depth, Adaptive Sparse Attention Module (ASAM) for efficient and focused context selection, and Self-Organizing Encoder Structure (SOES) for position-free structure induction. ReSSFormer replaces conventional depth stacking with recurrent inference, substitutes full attention with token- and expert-level sparsity, and models latent token topology directly from content. Across language modeling, multi-hop QA, and structure-sensitive tasks, ReSSFormer consistently outperforms strong baselines under comparable FLOPs and parameter budgets, highlighting its scalability, efficiency, and structural flexibility.

Paper Structure

This paper contains 15 sections, 8 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: Accuracy vs. input length.
  • Figure 2: Accuracy across structure-sensitive tasks.
  • Figure 3: Noise robustness comparison on NarrativeQA and HotpotQA. Bars indicate accuracy with distractor paragraphs; lines show degradation trends across model variants.