Sleep-time Compute: Beyond Inference Scaling at Test-time

Kevin Lin; Charlie Snell; Yu Wang; Charles Packer; Sarah Wooders; Ion Stoica; Joseph E. Gonzalez

Sleep-time Compute: Beyond Inference Scaling at Test-time

Kevin Lin, Charlie Snell, Yu Wang, Charles Packer, Sarah Wooders, Ion Stoica, Joseph E. Gonzalez

TL;DR

Sleep-time compute introduces offline reasoning on a given context to produce a re-represented context that reduces test-time compute while preserving accuracy. By formalizing a context-preprocessing phase and evaluating on stateful reasoning datasets (Stateful GSM-Symbolic, Stateful AIME) plus a Multi-Query variant, the approach yields Pareto improvements in latency vs. accuracy and substantial accuracy gains as sleep-time compute scales. A cost-aware analysis shows up to 2.5× per-query savings when multiple related queries share a context, and a case study on SWE demonstrates practical benefits in agentic software engineering tasks. Overall, sleep-time compute expands the latency-accuracy frontier for LLMs by exploiting available context before user queries and offers a pathway for scalable, stateful reasoning in real-world applications.

Abstract

Scaling test-time compute has emerged as a key ingredient for enabling large language models (LLMs) to solve difficult problems, but comes with high latency and inference cost. We introduce sleep-time compute, which allows models to "think" offline about contexts before queries are presented: by anticipating what queries users might ask and pre-computing useful quantities, we can significantly reduce the compute requirements at test-time. To demonstrate the efficacy of our method, we create modified versions of two reasoning tasks - Stateful GSM-Symbolic and Stateful AIME. We find that sleep-time compute can reduce the amount of test-time compute needed to achieve the same accuracy by ~ 5x on Stateful GSM-Symbolic and Stateful AIME and that by scaling sleep-time compute we can further increase accuracy by up to 13% on Stateful GSM-Symbolic and 18% on Stateful AIME. Furthermore, we introduce Multi-Query GSM-Symbolic, which extends GSM-Symbolic by including multiple related queries per context. By amortizing sleep-time compute across related queries about the same context using Multi-Query GSM-Symbolic, we can decrease the average cost per query by 2.5x. We then conduct additional analysis to understand when sleep-time compute is most effective, finding the predictability of the user query to be well correlated with the efficacy of sleep-time compute. Finally, we conduct a case-study of applying sleep-time compute to a realistic agentic SWE task.

Sleep-time Compute: Beyond Inference Scaling at Test-time

TL;DR

Abstract

Sleep-time Compute: Beyond Inference Scaling at Test-time

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (26)