Can Memory-Augmented Language Models Generalize on Reasoning-in-a-Haystack Tasks?
Payel Das, Ching-Yun Ko, Sihui Dai, Georgios Kollias, Subhajit Chaudhury, Aurelie Lozano
TL;DR
The paper addresses the brittleness of large language models in multi-step reasoning over long contexts by introducing MemReasoner, a memory-augmented architecture that learns temporal order of facts and supports iterative reads to update the query. Built on the Larimar framework, MemReasoner adds a temporal memory and an inference-time update to handle long haystack-style inputs, achieving robust generalization on synthetic 1-hop and 2-hop tasks with minimal supporting-fact supervision. Experiments on bAbi and Variable Tracking demonstrate that MemReasoner outperforms memory-based baselines, particularly as context length grows and when only a small fraction of supporting facts are available. Overall, the work suggests that explicit latent memory and weak supervision synergistically enhance context processing for reasoning in language models, with potential implications for more reliable real-world reasoning tasks.
Abstract
Large language models often expose their brittleness in reasoning tasks, especially while executing long chains of reasoning over context. We propose MemReasoner, a new and simple memory-augmented LLM architecture, in which the memory learns the relative order of facts in context, and enables hopping over them, while the decoder selectively attends to the memory. MemReasoner is trained end-to-end, with optional supporting fact supervision of varying degrees. We train MemReasoner, along with existing memory-augmented transformer models and a state-space model, on two distinct synthetic multi-hop reasoning tasks. Experiments performed under a variety of challenging scenarios, including the presence of long distractor text or target answer changes in test set, show strong generalization of MemReasoner on both single- and two-hop tasks. This generalization of MemReasoner is achieved using none-to-weak supporting fact supervision (using none and 1\% of supporting facts for one- and two-hop tasks, respectively). In contrast, baseline models overall struggle to generalize and benefit far less from using full supporting fact supervision. The results highlight the importance of explicit memory mechanisms, combined with additional weak supervision, for improving large language model's context processing ability toward reasoning tasks.
