Table of Contents
Fetching ...

Not All Bits Are Equal: Scale-Dependent Memory Optimization Strategies for Reasoning Models

Junhyuck Kim, Ethan Ewer, Taehong Moon, Jongho Park, Dimitris Papailiopoulos

TL;DR

This study shows that memory optimization for reasoning models is scale-dependent, with KV cache often dominating total memory. Through a comprehensive empirical sweep across 1,700 configurations on the Qwen3 family using AIME25 and GPQA-Diamond, the authors reveal a threshold around an effective size of $8$-bit $4$B, below which increasing model capacity yields better memory efficiency and above which extending generation length (test-time compute) is more memory-efficient. They compare cache eviction and cache quantization, finding eviction preferable for small models and quantization competitive for large ones, and demonstrate that parallel scaling only helps for larger models. The work provides principled deployment guidelines: for small reasoning models, prioritize model capacity (weights) over long generation, while for larger models, maximize test-time compute and parallel sampling, highlighting that memory optimization for reasoning models cannot follow a universal prescription. Overall, the paper reframes inference-time optimization by explicitly balancing weight precision, cache strategy, and token budgets within fixed memory budgets, guiding more effective deployment of reasoning-enabled LLMs.

Abstract

While 4-bit quantization has emerged as a memory-optimal choice for non-reasoning models and zero-shot tasks across scales, we show that this universal prescription fails for reasoning models, where the KV cache rather than model size can dominate memory. Through systematic experiments across 1,700 inference scenarios on AIME25 and GPQA-Diamond, we find a scale-dependent trade-off: models with an effective size below 8-bit 4B parameters achieve better accuracy by allocating memory to more weights rather than longer generation, while larger models achieve better accuracy by allocating memory to longer generations. This scale threshold also determines when parallel scaling becomes memory-efficient and whether KV cache eviction outperforms KV quantization. Our findings show that memory optimization for LLMs cannot be scale-agnostic, while providing principled guidelines: for small reasoning models, prioritize model capacity over test-time compute, while for larger ones, maximize test-time compute. Our results suggest that optimizing reasoning models for deployment requires fundamentally different strategies from those established for non-reasoning models.

Not All Bits Are Equal: Scale-Dependent Memory Optimization Strategies for Reasoning Models

TL;DR

This study shows that memory optimization for reasoning models is scale-dependent, with KV cache often dominating total memory. Through a comprehensive empirical sweep across 1,700 configurations on the Qwen3 family using AIME25 and GPQA-Diamond, the authors reveal a threshold around an effective size of -bit B, below which increasing model capacity yields better memory efficiency and above which extending generation length (test-time compute) is more memory-efficient. They compare cache eviction and cache quantization, finding eviction preferable for small models and quantization competitive for large ones, and demonstrate that parallel scaling only helps for larger models. The work provides principled deployment guidelines: for small reasoning models, prioritize model capacity (weights) over long generation, while for larger models, maximize test-time compute and parallel sampling, highlighting that memory optimization for reasoning models cannot follow a universal prescription. Overall, the paper reframes inference-time optimization by explicitly balancing weight precision, cache strategy, and token budgets within fixed memory budgets, guiding more effective deployment of reasoning-enabled LLMs.

Abstract

While 4-bit quantization has emerged as a memory-optimal choice for non-reasoning models and zero-shot tasks across scales, we show that this universal prescription fails for reasoning models, where the KV cache rather than model size can dominate memory. Through systematic experiments across 1,700 inference scenarios on AIME25 and GPQA-Diamond, we find a scale-dependent trade-off: models with an effective size below 8-bit 4B parameters achieve better accuracy by allocating memory to more weights rather than longer generation, while larger models achieve better accuracy by allocating memory to longer generations. This scale threshold also determines when parallel scaling becomes memory-efficient and whether KV cache eviction outperforms KV quantization. Our findings show that memory optimization for LLMs cannot be scale-agnostic, while providing principled guidelines: for small reasoning models, prioritize model capacity over test-time compute, while for larger ones, maximize test-time compute. Our results suggest that optimizing reasoning models for deployment requires fundamentally different strategies from those established for non-reasoning models.

Paper Structure

This paper contains 33 sections, 3 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Memory vs. Accuracy for serial test-time scaling on AIME25. The plot illustrates the trade-off between pass@1 accuracy and total memory (weights + KV cache) for the Qwen3 family. Model weights are quantized to 4- and 8-bit using GPTQ. Along each curve, the KV cache grows as the generation length increases via budget forcing. For models effectively smaller than an 8-bit 4B, increasing the token budget to saturation is memory-inefficient. Furthermore, for mathematical reasoning, higher weight precision (8- and 16-bit) proves more memory-efficient than 4-bit.
  • Figure 2: Composition of Pareto-optimal configurations (AIME25, Qwen3). The token budget (a) and effective model size (b) are plotted against the total memory budget for configurations on the Pareto frontier from Figure \ref{['fig:aime25_qwen3_accuracy_total_memory']}. The plots illustrate a strategic shift: at lower memory budgets (<10 GB), increasing effective model size is memory-efficient, whereas at higher budgets, increasing the token budget becomes the dominant strategy for improving performance.
  • Figure 3: Memory vs. Accuracy under different theoretical batch sizes (AIME25, Qwen3). Each subplot shows memory-per-generation vs. accuracy for different theoretical batch sizes, where model weight memory is amortized across concurrent generations. The Pareto frontier shifts as batch size increases, revealing how model weight amortization affects the optimal memory allocation strategy.
  • Figure 4: Memory vs. Accuracy on GPQA-Diamond (Qwen3). The memory--accuracy trade-off for serial scaling on GPQA-Diamond. Total memory is the sum of model weights and KV cache. Points along each curve represent increasing token budgets. 4-bit weights are broadly memory-optimal for knowledge-intensive tasks.
  • Figure 5: Effect of parallel scaling on the Pareto frontier (Qwen3). Each colored curve represents the Pareto frontier for a specific model size and weight precision, obtained by increasing the sampling group size, $G$. The Pareto frontier for serial scaling ($G=1$) across all models is shown as a dotted line. Parallel scaling is only effective for large models.
  • ...and 7 more figures