Table of Contents
Fetching ...

BrowseConf: Confidence-Guided Test-Time Scaling for Web Agents

Litu Ou, Kuan Li, Huifeng Yin, Liwen Zhang, Zhongwang Zhang, Xixi Wu, Rui Ye, Zile Qiao, Pengjun Xie, Jingren Zhou, Yong Jiang

TL;DR

The paper tackles confidence estimation in multi-turn web-agent workflows, where verbalized confidence correlates with task success but calibration is unreliable. It introduces BrowseConf, a test-time scaling framework that uses a calibrated confidence threshold to trigger additional attempts, thereby allocating computation where it most improves answers. By comparing three variants (Zero, Summary-guided, Negative-constrained) against fixed-budget baselines on BrowseComp and its Chinese counterpart, the study shows BrowseConf can match or exceed baseline accuracy while markedly reducing the average number of attempts. The work demonstrates the practical benefit of confidence-aware compute allocation for complex information-seeking agents and points to future improvements via richer cross-attempt feedback.

Abstract

Confidence in LLMs is a useful indicator of model uncertainty and answer reliability. Existing work mainly focused on single-turn scenarios, while research on confidence in complex multi-turn interactions is limited. In this paper, we investigate whether LLM-based search agents have the ability to communicate their own confidence through verbalized confidence scores after long sequences of actions, a significantly more challenging task compared to outputting confidence in a single interaction. Experimenting on open-source agentic models, we first find that models exhibit much higher task accuracy at high confidence while having near-zero accuracy when confidence is low. Based on this observation, we propose Test-Time Scaling (TTS) methods that use confidence scores to determine answer quality, encourage the model to try again until reaching a satisfactory confidence level. Results show that our proposed methods significantly reduce token consumption while demonstrating competitive performance compared to baseline fixed budget TTS methods.

BrowseConf: Confidence-Guided Test-Time Scaling for Web Agents

TL;DR

The paper tackles confidence estimation in multi-turn web-agent workflows, where verbalized confidence correlates with task success but calibration is unreliable. It introduces BrowseConf, a test-time scaling framework that uses a calibrated confidence threshold to trigger additional attempts, thereby allocating computation where it most improves answers. By comparing three variants (Zero, Summary-guided, Negative-constrained) against fixed-budget baselines on BrowseComp and its Chinese counterpart, the study shows BrowseConf can match or exceed baseline accuracy while markedly reducing the average number of attempts. The work demonstrates the practical benefit of confidence-aware compute allocation for complex information-seeking agents and points to future improvements via richer cross-attempt feedback.

Abstract

Confidence in LLMs is a useful indicator of model uncertainty and answer reliability. Existing work mainly focused on single-turn scenarios, while research on confidence in complex multi-turn interactions is limited. In this paper, we investigate whether LLM-based search agents have the ability to communicate their own confidence through verbalized confidence scores after long sequences of actions, a significantly more challenging task compared to outputting confidence in a single interaction. Experimenting on open-source agentic models, we first find that models exhibit much higher task accuracy at high confidence while having near-zero accuracy when confidence is low. Based on this observation, we propose Test-Time Scaling (TTS) methods that use confidence scores to determine answer quality, encourage the model to try again until reaching a satisfactory confidence level. Results show that our proposed methods significantly reduce token consumption while demonstrating competitive performance compared to baseline fixed budget TTS methods.

Paper Structure

This paper contains 25 sections, 1 equation, 3 figures, 7 tables, 1 algorithm.

Figures (3)

  • Figure 1: Bar charts showing accuracy against verbalized confidence score intervals for gpt-oss-120b (left) and DeepSeek-V3.1 (right). X-axis represents the model's confidence, grouped into 5-point intervals. Y-axis indicates task accuracy. Green lines plot accuracies for items within each confidence interval. Grey bars show proportion of items in each respective interval. Dashed purple horizontal lines shows the overall accuracy for each model. Intervals containing no items are omitted from the plots.
  • Figure 2: Average change in number of interactions between consecutive attempts, tested on BrowseComp using DeepSeek-V3.1.
  • Figure :