Table of Contents
Fetching ...

Sleep-time Compute: Beyond Inference Scaling at Test-time

Kevin Lin, Charlie Snell, Yu Wang, Charles Packer, Sarah Wooders, Ion Stoica, Joseph E. Gonzalez

TL;DR

Sleep-time compute introduces offline reasoning on a given context to produce a re-represented context that reduces test-time compute while preserving accuracy. By formalizing a context-preprocessing phase and evaluating on stateful reasoning datasets (Stateful GSM-Symbolic, Stateful AIME) plus a Multi-Query variant, the approach yields Pareto improvements in latency vs. accuracy and substantial accuracy gains as sleep-time compute scales. A cost-aware analysis shows up to 2.5× per-query savings when multiple related queries share a context, and a case study on SWE demonstrates practical benefits in agentic software engineering tasks. Overall, sleep-time compute expands the latency-accuracy frontier for LLMs by exploiting available context before user queries and offers a pathway for scalable, stateful reasoning in real-world applications.

Abstract

Scaling test-time compute has emerged as a key ingredient for enabling large language models (LLMs) to solve difficult problems, but comes with high latency and inference cost. We introduce sleep-time compute, which allows models to "think" offline about contexts before queries are presented: by anticipating what queries users might ask and pre-computing useful quantities, we can significantly reduce the compute requirements at test-time. To demonstrate the efficacy of our method, we create modified versions of two reasoning tasks - Stateful GSM-Symbolic and Stateful AIME. We find that sleep-time compute can reduce the amount of test-time compute needed to achieve the same accuracy by ~ 5x on Stateful GSM-Symbolic and Stateful AIME and that by scaling sleep-time compute we can further increase accuracy by up to 13% on Stateful GSM-Symbolic and 18% on Stateful AIME. Furthermore, we introduce Multi-Query GSM-Symbolic, which extends GSM-Symbolic by including multiple related queries per context. By amortizing sleep-time compute across related queries about the same context using Multi-Query GSM-Symbolic, we can decrease the average cost per query by 2.5x. We then conduct additional analysis to understand when sleep-time compute is most effective, finding the predictability of the user query to be well correlated with the efficacy of sleep-time compute. Finally, we conduct a case-study of applying sleep-time compute to a realistic agentic SWE task.

Sleep-time Compute: Beyond Inference Scaling at Test-time

TL;DR

Sleep-time compute introduces offline reasoning on a given context to produce a re-represented context that reduces test-time compute while preserving accuracy. By formalizing a context-preprocessing phase and evaluating on stateful reasoning datasets (Stateful GSM-Symbolic, Stateful AIME) plus a Multi-Query variant, the approach yields Pareto improvements in latency vs. accuracy and substantial accuracy gains as sleep-time compute scales. A cost-aware analysis shows up to 2.5× per-query savings when multiple related queries share a context, and a case study on SWE demonstrates practical benefits in agentic software engineering tasks. Overall, sleep-time compute expands the latency-accuracy frontier for LLMs by exploiting available context before user queries and offers a pathway for scalable, stateful reasoning in real-world applications.

Abstract

Scaling test-time compute has emerged as a key ingredient for enabling large language models (LLMs) to solve difficult problems, but comes with high latency and inference cost. We introduce sleep-time compute, which allows models to "think" offline about contexts before queries are presented: by anticipating what queries users might ask and pre-computing useful quantities, we can significantly reduce the compute requirements at test-time. To demonstrate the efficacy of our method, we create modified versions of two reasoning tasks - Stateful GSM-Symbolic and Stateful AIME. We find that sleep-time compute can reduce the amount of test-time compute needed to achieve the same accuracy by ~ 5x on Stateful GSM-Symbolic and Stateful AIME and that by scaling sleep-time compute we can further increase accuracy by up to 13% on Stateful GSM-Symbolic and 18% on Stateful AIME. Furthermore, we introduce Multi-Query GSM-Symbolic, which extends GSM-Symbolic by including multiple related queries per context. By amortizing sleep-time compute across related queries about the same context using Multi-Query GSM-Symbolic, we can decrease the average cost per query by 2.5x. We then conduct additional analysis to understand when sleep-time compute is most effective, finding the predictability of the user query to be well correlated with the efficacy of sleep-time compute. Finally, we conduct a case-study of applying sleep-time compute to a realistic agentic SWE task.

Paper Structure

This paper contains 45 sections, 26 figures, 1 table.

Figures (26)

  • Figure 1: Example of applying sleep-time compute on Multi-Query GSM-Symbolic-P1. Sleep-time compute processes the original raw context, adding additional computations that can potentially be useful for future queries. Moreover, contexts can be shared across related queries enabling savings in total cost per query.
  • Figure 2: Example of separating an instance from GSM-Symbolic into context, and question, creating an instance in Stateful GSM-Symbolic.
  • Figure 3: The test-time compute vs. accuracy tradeoff for on Stateful GSM-Symbolic. Shaded area indicates where sleep-time compute improves the pareto test-time accuracy trade-off.
  • Figure 4: The test-time compute vs. accuracy tradeoff on Stateful AIME for various reasoning models. Applying sleep-time compute allows models to reach similar levels of performance with much less compute at test-time. The shaded area indicates the pareto improvement from sleep-time compute.
  • Figure 5: Comparing test-time scaling with sleep-time compute against parallel test-time scaling with pass@k on Stateful GSM-Symbolic. We see that sleep-time compute generally pareto dominates pass@k.
  • ...and 21 more figures