Agentic Test-Time Scaling for WebAgents

Nicholas Lee; Lutfi Eren Erdogan; Chris Joseph John; Surya Krishnapillai; Michael W. Mahoney; Kurt Keutzer; Amir Gholami

Agentic Test-Time Scaling for WebAgents

Nicholas Lee, Lutfi Eren Erdogan, Chris Joseph John, Surya Krishnapillai, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

TL;DR

Agentic Test-Time Scaling for WebAgents investigates how to allocate inference-time compute for multi-step web agents. The authors show that uniform per-step scaling yields diminishing returns in long-horizon tasks and propose CATTS, a dynamic policy that uses vote-derived uncertainty (entropy and margin) to gate arbitration only on contentious steps. Through extensive experiments on WebArena-Lite and GoBrowse, CATTS achieves consistent performance gains while reducing token usage, outperforming static voting, simple arbitration, and several deeper aggregation baselines. The work reveals two operating regimes—redundancy (high consensus) and contention (genuine uncertainty)—and demonstrates that uncertainty-guided compute allocation provides practical, interpretable improvements for reliable agentic behavior in browser-based tasks.

Abstract

Test-time scaling has become a standard way to improve performance and boost reliability of neural network models. However, its behavior on agentic, multi-step tasks remains less well-understood: small per-step errors can compound over long horizons; and we find that naive policies that uniformly increase sampling show diminishing returns. In this work, we present CATTS, a simple technique for dynamically allocating compute for multi-step agents. We first conduct an empirical study of inference-time scaling for web agents. We find that uniformly increasing per-step compute quickly saturates in long-horizon environments. We then investigate stronger aggregation strategies, including an LLM-based Arbiter that can outperform naive voting, but that can overrule high-consensus decisions. We show that uncertainty statistics derived from the agent's own vote distribution (entropy and top-1/top-2 margin) correlate with downstream success and provide a practical signal for dynamic compute allocation. Based on these findings, we introduce Confidence-Aware Test-Time Scaling (CATTS), which uses vote-derived uncertainty to allocate compute only when decisions are genuinely contentious. CATTS improves performance on WebArena-Lite and GoBrowse by up to 9.1% over React while using up to 2.3x fewer tokens than uniform scaling, providing both efficiency gains and an interpretable decision rule.

Agentic Test-Time Scaling for WebAgents

TL;DR

Abstract

Paper Structure (46 sections, 9 equations, 7 figures, 10 tables)

This paper contains 46 sections, 9 equations, 7 figures, 10 tables.

Introduction
Related Work
Inference-Time Scaling and Test-Time Compute.
Tool-Using Agents and Long-Horizon Tasks.
From Static to Dynamic Inference-Time Scaling
Experimental Setup
Action Clustering and Vote Distributions.
Static Baseline: Majority Voting
Observation: Majority Vote Yields Diminishing Returns.
Takeaway.
From Voting to Arbitration
Observation: Arbitration Improves Over Majority Vote.
Arbiter Scaling.
Observation: Arbitration is not Uniformly Beneficial.
Deeper Aggregation Methods.
...and 31 more sections

Figures (7)

Figure 1: Comparing agentic inference-time scaling methods. Visual comparison of selection strategies at each agent step. Left: Majority Voting samples $N$ candidates and selects the most frequent action via argmax over vote distribution $p_t(a)$. Center-Left: Arbiter samples $N$ candidates and uses an additional LLM call to reason over candidates and select the best action. Center-Right:CATTS conditionally invokes the arbiter only when vote-derived uncertainty (entropy $H_t$ or margin $\Delta_t$) exceeds threshold $\tau$, otherwise falls back to majority voting.
Figure 2: Uncertainty profiles over trajectory steps. Entropy $H_t$ (top) and probability margin $\Delta_t$ (bottom) versus step index, separated by successful (blue) and failed (orange) runs, averaged across all experiments on both WebArena-Lite and GoBrowse. Failed tasks consistently exhibit higher entropy and lower margins throughout, with the gap widening at later steps. Successful tasks maintain high margins (${\approx}0.7$) and low entropy (${\approx}0.3$) early on, indicating clearer consensus. This demonstrates that vote-derived uncertainty is correlated with task success and can guide dynamic compute allocation.
Figure 3: High-consensus override analysis. Task success rate decreases as the number of high-consensus overrides ($\Delta_t > 0.7$) increases, showing a dose-response pattern. The red dashed line indicates the overall success rate (44.0%). The effect is significant ($p = 0.026$, Fisher's exact test) and consistent across all websites.
Figure 4: Arbiter effectiveness varies with uncertainty. Tasks are grouped by average trajectory entropy and evaluated under both arbiter and majority voting, aggregated across all runs. Net advantage measures the difference in win rates: positive values indicate that arbitration succeeds on more tasks where majority voting fails than vice versa. At low entropy, arbitration provides no benefit and can hurt performance ($-4.4\%$). At higher entropy levels, arbitration consistently outperforms majority voting ($+4$-$6\%$), demonstrating that its reasoning capabilities are most valuable when the candidate distribution lacks a clear signal.
Figure 5: Accuracy--compute frontier across all methods. Success rate versus total tokens per episode on WebArena-Lite (left) and GoBrowse (right). Each point represents a different configuration: Majority Vote varies $N \in \{1,3,5,10,20\}$; Arbiter shows $K{=}1$ (one Arbiter used) with varying $N$; Arbiter Scaling shows increasing $K$ at fixed $N{=}5$; CATTS (Entropy/Margin) sweeps thresholds $\tau$ at $N{=}10$; DeepConf varies $N \in \{3,5,10,20\}$. CATTS achieves Pareto improvements: on WebArena-Lite, it reaches 47.9% success at ${\sim}750$K tokens (vs. 43.2% for Majority Vote at 920K tokens). DeepConf also performs strongly, achieving competitive accuracy at lower token budgets than majority vote.
...and 2 more figures

Agentic Test-Time Scaling for WebAgents

TL;DR

Abstract

Agentic Test-Time Scaling for WebAgents

Authors

TL;DR

Abstract

Table of Contents

Figures (7)