Table of Contents
Fetching ...

Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph

Fali Wang, Jihai Chen, Shuhua Yang, Runxue Bao, Tianxiang Zhao, Zhiwei Zhang, Xianfeng Tang, Hui Liu, Qi He, Suhang Wang

TL;DR

This work tackles compute-optimal test-time scaling by generalizing TTS to dynamic, task-specific multi-LLM collaboration graphs under a fixed budget. It formalizes the problem as probabilistic graph optimization over edges, roles, and model assignments, and introduces Agent-REINFORCE, an LLM-agent-augmented REINFORCE framework that uses textual feedback as gradients to efficiently navigate the large search space. Three empirical insights guide initialization and updates, enabling task-aware exploration of width-depth trade-offs and model-family replication. Experiments on MATH, MMLU, and HumanEval show that Agent-REINFORCE outperforms traditional and LLM-based baselines in search efficiency and final performance, including when optimizing for joint accuracy and latency, demonstrating practical benefits for adaptive, compute-aware inference.

Abstract

Test-Time Scaling (TTS) improves large language models (LLMs) by allocating additional computation during inference, typically through parallel, sequential, or hybrid scaling. However, prior studies often assume fixed collaboration architectures (e.g., topologies) and single-model usage, overlooking that optimal architectures and model combinations can vary across tasks. Therefore, we study the novel problem of searching for compute-optimal model combinations and architectures in TTS under a fixed budget. We formalize it as a multi-LLM collaboration graph, where nodes encode roles and LLM model assignments, and edges capture information flow. This problem is challenging because (i) the combinatorial search space is prohibitively large, and (ii) task-specific requirements demand tailored designs. To address these, we reformulate the problem as probabilistic graph optimization and, through pilot experiments, derive three empirical insights into TTS collaboration graphs. Guided by these insights, we propose Agent-REINFORCE, an LLM-agent-augmented framework that mirrors the REINFORCE pipeline by mapping sampling-gradient-update to sampling-feedback-update, where feedback serves as a textual gradient to update the probabilistic graph and efficiently search for optimal multi-LLM collaboration graphs. Experiments show that Agent-REINFORCE outperforms both traditional and LLM-based baselines in sample efficiency and search performance, and effectively identifies optimal graphs under joint objectives of accuracy and inference latency.

Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph

TL;DR

This work tackles compute-optimal test-time scaling by generalizing TTS to dynamic, task-specific multi-LLM collaboration graphs under a fixed budget. It formalizes the problem as probabilistic graph optimization over edges, roles, and model assignments, and introduces Agent-REINFORCE, an LLM-agent-augmented REINFORCE framework that uses textual feedback as gradients to efficiently navigate the large search space. Three empirical insights guide initialization and updates, enabling task-aware exploration of width-depth trade-offs and model-family replication. Experiments on MATH, MMLU, and HumanEval show that Agent-REINFORCE outperforms traditional and LLM-based baselines in search efficiency and final performance, including when optimizing for joint accuracy and latency, demonstrating practical benefits for adaptive, compute-aware inference.

Abstract

Test-Time Scaling (TTS) improves large language models (LLMs) by allocating additional computation during inference, typically through parallel, sequential, or hybrid scaling. However, prior studies often assume fixed collaboration architectures (e.g., topologies) and single-model usage, overlooking that optimal architectures and model combinations can vary across tasks. Therefore, we study the novel problem of searching for compute-optimal model combinations and architectures in TTS under a fixed budget. We formalize it as a multi-LLM collaboration graph, where nodes encode roles and LLM model assignments, and edges capture information flow. This problem is challenging because (i) the combinatorial search space is prohibitively large, and (ii) task-specific requirements demand tailored designs. To address these, we reformulate the problem as probabilistic graph optimization and, through pilot experiments, derive three empirical insights into TTS collaboration graphs. Guided by these insights, we propose Agent-REINFORCE, an LLM-agent-augmented framework that mirrors the REINFORCE pipeline by mapping sampling-gradient-update to sampling-feedback-update, where feedback serves as a textual gradient to update the probabilistic graph and efficiently search for optimal multi-LLM collaboration graphs. Experiments show that Agent-REINFORCE outperforms both traditional and LLM-based baselines in sample efficiency and search performance, and effectively identifies optimal graphs under joint objectives of accuracy and inference latency.

Paper Structure

This paper contains 43 sections, 1 theorem, 23 equations, 9 figures, 5 tables, 3 algorithms.

Key Result

Theorem 1

For each node $v_i$, the cost depends on the size of the model and its effective input/output lengths, leading to a dependence on the node in-degree $d(v_i)$. Adding up to all nodes, the total cost can be expressed as $f_{\text{cost}}(G,T) \;=\; \sum_{v_i \in \mathcal{V}} \bigl[\, \alpha_i\, d(v_i)^

Figures (9)

  • Figure 1: Accuracy across different topologies and model combinations on MATH and MMLU. LLaMA-3 models are used by default. Detailed data is in Appendix \ref{['appendix_pilot']}.
  • Figure 2: Generalize TTS as a graph.
  • Figure 3: Performance on MATH and MMLU across model family and size. LLaMA by default.
  • Figure 4: (a--b) Performance with Parallel and Sequential Scaling on various datasets. (c) Heatmap of performance under various Width-Depth collaboration graphs on MATH. Model is LLaMA-3 1B.
  • Figure 5: Overview of Agent-REINFORCE for Optimizing Collaboration Graph.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Theorem 1: FLOPs Cost Function
  • Definition 1: Test-time Compute-optimal Multi-LLM Collaboration Graph for a Specific Task