PATHWAYS: Evaluating Investigation and Context Discovery in AI Web Agents

Shifat E. Arman; Syed Nazmus Sakib; Tapodhir Karmakar Taton; Nafiul Haque; Shahrear Bin Amin

PATHWAYS: Evaluating Investigation and Context Discovery in AI Web Agents

Shifat E. Arman, Syed Nazmus Sakib, Tapodhir Karmakar Taton, Nafiul Haque, Shahrear Bin Amin

TL;DR

PATHWAYS presents a 250-task benchmark to probe investigative competence in AI web agents, focusing on uncovering hidden context beyond surface signals. It shows a persistent Navigation–Discovery Gap and Investigative Hallucination across models, with performance collapsing when surface cues mislead and hidden evidence must be sought and integrated. Through two domains (Shopping Admin and Reddit Moderation) and a rigorous metric suite, the study reveals last-mile failures where found evidence and reasoning do not translate into correct, policy-compliant decisions, and demonstrates that prompting strategies can trade discovery gains for decision quality. The work argues for architectural advances in epistemic curiosity and evidence-grade assessment to build safer, more accountable, and more reliable web agents in information-asymmetric environments.

Abstract

We introduce PATHWAYS, a benchmark of 250 multi-step decision tasks that test whether web-based agents can discover and correctly use hidden contextual information. Across both closed and open models, agents typically navigate to relevant pages but retrieve decisive hidden evidence in only a small fraction of cases. When tasks require overturning misleading surface-level signals, performance drops sharply to near chance accuracy. Agents frequently hallucinate investigative reasoning by claiming to rely on evidence they never accessed. Even when correct context is discovered, agents often fail to integrate it into their final decision. Providing more explicit instructions improves context discovery but often reduces overall accuracy, revealing a tradeoff between procedural compliance and effective judgement. Together, these results show that current web agent architectures lack reliable mechanisms for adaptive investigation, evidence integration, and judgement override.

PATHWAYS: Evaluating Investigation and Context Discovery in AI Web Agents

TL;DR

Abstract

Paper Structure (49 sections, 8 equations, 14 figures, 19 tables)

This paper contains 49 sections, 8 equations, 14 figures, 19 tables.

Introduction
Related Work
Functional Competence and Safety Benchmarks
Reasoning, Hallucination, and Epistemic Limitations
PATHWAYS
Human Curation and Task Structure
Shopping Admin Task Design
Reddit Moderation Task Design
Evaluation
Component Metrics
Process Integrity
Behavioral Analysis
Experimentation
How well do models conduct investigation?
How well do models reason regarding investigation?
...and 34 more sections

Figures (14)

Figure 1: The landscape of Autonomous Agent evaluation. Existing benchmarks predominantly focus on either Functional Competence (executing explicit instructions) or Adversarial Safety (refusing harmful prompts). PATHWAYS introduces the third critical dimension: Investigative Competence. By requiring agents to actively seek hidden context, PATHWAYS bridges the gap between blind execution and rigid refusal, aiming towards the Ideal Robust Agent that is autonomous, safe, and context-aware.
Figure 2: Overview of investigative trajectories in PATHWAYS. Left: A Reddit moderation task requiring cross-referencing an external wiki to verify a misleading claim. Right: A Shopping Admin task where a refund request is audited by correlating surface order details with hidden shipping logs and internal security notes.
Figure 3: Task Completion Rate ($R_{comp}$) across models and domains. While models achieve near-perfect completion on Reddit (e.g., Gemini at 98.9%), they struggle significantly more with the complex navigation of Shopping Admin.
Figure 4: Task-category level Task Completion Rate ($R_{\mathrm{comp}}$) across models in Shopping Admin (left) and Reddit Moderation (right), illustrating differences in agents’ ability to reach a terminal decision across task types.
Figure 5: Qualitative comparison of Agent reasoning. The Raw prompt leads to aggressive, generalized conclusions ("Ban User"), whereas the Engineered prompt with explicit hints guides the agent to specific contradictory evidence, resulting in a more precise and proportionate decision ("Apply Misinformation Flair").
...and 9 more figures

PATHWAYS: Evaluating Investigation and Context Discovery in AI Web Agents

TL;DR

Abstract

PATHWAYS: Evaluating Investigation and Context Discovery in AI Web Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (14)