Table of Contents
Fetching ...

Pushing Test-Time Scaling Limits of Deep Search with Asymmetric Verification

Weihao Zeng, Keqing He, Chuqiao Kuang, Xiaoguang Li, Junxian He

TL;DR

This work investigates how to push test-time scaling limits for deep search agents by exploiting asymmetric verification, where verifying a candidate answer is easier than generating it. It analyzes sequential and parallel scaling, showing that while sequential scaling saturates quickly, verifier-based strategies (and parallel sampling) deliver superior accuracy-cost trade-offs. It demonstrates that open-source models can reach or rival commercial systems by turning them into Heavy variants through test-time scaling, achieving up to 69% on BrowseComp and strong GAIA/xbench performance. The study highlights practical compute-efficient design choices for agentic search and suggests directions for integrating verification more deeply into training and inference to guide search trajectories.

Abstract

Test-time compute can be scaled both sequentially and in parallel. Sequential scaling involves lengthening the generation process, while parallel scaling involves verifying and selecting among multiple candidate outputs. Combining these two strategies has led to the most powerful AI systems, such as Grok 4 Heavy and GPT-5 Pro. In certain contexts (e.g., solving Sudoku puzzles), verifying responses can be substantially easier than generating them. This property, referred to as \emph{asymmetric verification}, highlights the strong potential of test-time scaling (TTS). In this work, we study both sequential and parallel TTS of deep search agents, motivated by the intuition that verification in this setting is often much easier than generation. In experiments, we first show that sequential scaling methods, such as budget forcing, can be effective initially but soon degrade performance. Leveraging asymmetric verification, however, we are able to achieve substantial improvements by allocating only a modest amount of compute to the verifier. We conduct experiments with flagship open-source models and extend them to their ``Heavy'' variants through TTS. These deep research agents achieve gains of up to 27 absolute points on benchmarks such as BrowseComp. Remarkably, as an open-source alternative, GLM-4.5 Heavy reaches accuracy of {\bf 54.0\%} on BrowseComp and {\bf 66.0\%} on GAIA, placing it comparable to the best proprietary choices such as OpenAI Deep Research. Tongyi-DeepResearch Heavy further achieves {\bf 69.0\%} accuracy on BrowseComp, greatly surpassing the best proprietary results.

Pushing Test-Time Scaling Limits of Deep Search with Asymmetric Verification

TL;DR

This work investigates how to push test-time scaling limits for deep search agents by exploiting asymmetric verification, where verifying a candidate answer is easier than generating it. It analyzes sequential and parallel scaling, showing that while sequential scaling saturates quickly, verifier-based strategies (and parallel sampling) deliver superior accuracy-cost trade-offs. It demonstrates that open-source models can reach or rival commercial systems by turning them into Heavy variants through test-time scaling, achieving up to 69% on BrowseComp and strong GAIA/xbench performance. The study highlights practical compute-efficient design choices for agentic search and suggests directions for integrating verification more deeply into training and inference to guide search trajectories.

Abstract

Test-time compute can be scaled both sequentially and in parallel. Sequential scaling involves lengthening the generation process, while parallel scaling involves verifying and selecting among multiple candidate outputs. Combining these two strategies has led to the most powerful AI systems, such as Grok 4 Heavy and GPT-5 Pro. In certain contexts (e.g., solving Sudoku puzzles), verifying responses can be substantially easier than generating them. This property, referred to as \emph{asymmetric verification}, highlights the strong potential of test-time scaling (TTS). In this work, we study both sequential and parallel TTS of deep search agents, motivated by the intuition that verification in this setting is often much easier than generation. In experiments, we first show that sequential scaling methods, such as budget forcing, can be effective initially but soon degrade performance. Leveraging asymmetric verification, however, we are able to achieve substantial improvements by allocating only a modest amount of compute to the verifier. We conduct experiments with flagship open-source models and extend them to their ``Heavy'' variants through TTS. These deep research agents achieve gains of up to 27 absolute points on benchmarks such as BrowseComp. Remarkably, as an open-source alternative, GLM-4.5 Heavy reaches accuracy of {\bf 54.0\%} on BrowseComp and {\bf 66.0\%} on GAIA, placing it comparable to the best proprietary choices such as OpenAI Deep Research. Tongyi-DeepResearch Heavy further achieves {\bf 69.0\%} accuracy on BrowseComp, greatly surpassing the best proprietary results.

Paper Structure

This paper contains 32 sections, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Top part shows accuracy on BrowseComp and GAIA. Results marked with * are from our test runs; *-Heavy denotes the accuracy after our test-time scaling. Bottom left shows how accuracy on BrowseComp varies with tool calls. Solid lines indicate scaling the search agent’s compute, dashed lines indicate allocating compute to a verifier. Bottom right shows strategies for extending GLM-4.5 to GLM-4.5 Heavy on BrowseComp.
  • Figure 2: Sequential scaling of compute, with the x-axis representing the actual number of tool calls, and the y-axis representing Pass@1 accuracy on BrowseComp. For Max # Tool Call, search tool limits are set to 15, 30, and 50 for Qwen3-2507 and K2, and 15, 30, 50, and 100 for GLM-4.5. Budget Forcing begins from each model’s peak Max # Tool Call setting and expands until saturation: Qwen3-2507 starts at 15, adding 15 tools per step (7 expansions); K2 starts at 30, adding 30 per step (2 expansions); GLM-4.5 starts at 50, adding 50 per step (2 expansions).
  • Figure 3: Parallel scaling of compute, where the x-axis shows K and the y-axis shows Pass@K and Maj@K accuracy on BrowseComp. For K2 and GLM-4.5, we first apply the Max # Tool Call strategy, setting maximum tool usage to 30 and 50 respectively, then perform parallel scaling by independently sampling K = 1, 2, 4, 8, 16, 24, 32 trajectories. For Qwen3-2507, we apply the Max # Tool Call strategy with a maximum of 15 tools, then apply Budget Forcing to add 15 more, followed by parallel scaling with K = 1, 2, 4, 8, 16, 24, 32 trajectories.
  • Figure 4: Parallel scaling results of different models on BrowseComp. The x-axis shows the number of tool calls counting both the searcher and the verifier. The solid lines represent the growth of Maj@K corresponding to scaling the searcher's compute, while the dashed lines represent the growth of Best-of-K and Weighted Voting after introducing a verifier.
  • Figure 5: Different strategies for scaling verifier computation across models. The top panel shows accuracy, and the bottom panel shows the corresponding number of tool calls. Along the x-axis, vanilla indicates the Maj@8 accuracy achieved by scaling the search agent without verification, while Max # Tool Call, Budget Forcing, and Parallel Scaling show the Best-of-8 results when applying these strategies to increase verifier compute.
  • ...and 9 more figures