Reuse, Don't Recompute: Efficient Large Reasoning Model Inference via Memory Orchestration
Daivik Patel, Shrenik Patel
TL;DR
The paper tackles the high cost of test-time reasoning in large reasoning models by introducing ENGRAM-R, an inference-time memory layer that outside the model weights uses typed episodic, semantic, and procedural memory to retrieve compact, auditable evidence. Retrieved records are rendered as compact Fact Cards and used with explicit citation constraints, bounding reasoning length and enabling traceable justification. On LoCoMo and LongMemEval$_ ext{S}$, ENGRAM-R delivers order-of-magnitude reductions in input and substantial reductions in reasoning tokens while maintaining or improving accuracy, especially for multi-hop and temporal reasoning. The approach demonstrates practical compute savings with real-world deployments and generalizes to domains like healthcare (HealthBench), highlighting memory reuse as a viable lever for efficient reasoning under tight budgets.
Abstract
Large reasoning models (LRMs) achieve strong accuracy through test-time scaling, generating longer chains of thought or sampling multiple solutions, but at steep costs in tokens and latency. We argue that memory is a core ingredient for efficient reasoning: when evidence already exists, models should think less by reusing structured memory instead of recomputing derivations. We present ENGRAM-R, an inference-time memory layer that integrates typed retrieval with compact fact card representations and explicit citation control. On the LoCoMo benchmark, ENGRAM-R reduces input tokens by 85% and reasoning tokens by 75% compared to full context while maintaining high accuracy. On a multi-hop slice of the LongMemEval benchmark, it achieves similar efficiency with substantial accuracy gains. These results show that memory is not only critical for long-horizon correctness but also a practical lever for efficient reasoning under tight compute, memory, and latency budgets.
