Table of Contents
Fetching ...

When Is Enough Not Enough? Illusory Completion in Search Agents

Dayoon Ko, Jihyuk Kim, Sohyeon Kim, Haeju Park, Dahyun Lee, Gunhee Kim, Moontae Lee, Kyungjae Lee

TL;DR

The paper investigates illusory completion in search agents solving multi-constraint questions, revealing that final-answer correctness often disguises unverified or violated constraints. It introduces the Epistemic Ledger to jointly track evidential support and agent beliefs for each constraint, and identifies four failure modes: bare assertion, overlooked refutation, stagnation, and premature exit. The study shows that explicit constraint-state tracking with LiveLedger reduces underverification and boosts accuracy across diverse agents, while also enhancing exploration efficiency. These findings underscore the importance of process-aware evaluation in multi-constraint QA and offer a practical mechanism to improve reliability in long-horizon search agents.

Abstract

Recent search agents leverage multi-turn reasoning and search tools to achieve strong performance on multi-hop and long-horizon benchmarks. Yet it remains unclear whether they reliably reason across all requirements by tracking, verifying, and maintaining multiple conditions in these questions. We study this capability under multi-constraint problems, where valid answers must satisfy several constraints simultaneously. We find that illusory completion frequently occurs, wherein agents believe tasks are complete despite unresolved or violated constraints, leading to underverified answers. To diagnose this behavior, we introduce the Epistemic Ledger, an evaluation framework that tracks evidential support and agents' beliefs for each constraint throughout multi-turn reasoning. Our analysis reveals four recurring failure patterns: bare assertions, overlooked refutations, stagnation, and premature exit. Motivated by these findings, we examine whether explicit constraint-state tracking during execution mitigates these failures via LiveLedger, an inference-time tracker. This simple intervention consistently improves performance, substantially reducing underverified answers (by up to 26.5%) and improving overall accuracy (by up to 11.6%) on multi-constraint problems.

When Is Enough Not Enough? Illusory Completion in Search Agents

TL;DR

The paper investigates illusory completion in search agents solving multi-constraint questions, revealing that final-answer correctness often disguises unverified or violated constraints. It introduces the Epistemic Ledger to jointly track evidential support and agent beliefs for each constraint, and identifies four failure modes: bare assertion, overlooked refutation, stagnation, and premature exit. The study shows that explicit constraint-state tracking with LiveLedger reduces underverification and boosts accuracy across diverse agents, while also enhancing exploration efficiency. These findings underscore the importance of process-aware evaluation in multi-constraint QA and offer a practical mechanism to improve reliability in long-horizon search agents.

Abstract

Recent search agents leverage multi-turn reasoning and search tools to achieve strong performance on multi-hop and long-horizon benchmarks. Yet it remains unclear whether they reliably reason across all requirements by tracking, verifying, and maintaining multiple conditions in these questions. We study this capability under multi-constraint problems, where valid answers must satisfy several constraints simultaneously. We find that illusory completion frequently occurs, wherein agents believe tasks are complete despite unresolved or violated constraints, leading to underverified answers. To diagnose this behavior, we introduce the Epistemic Ledger, an evaluation framework that tracks evidential support and agents' beliefs for each constraint throughout multi-turn reasoning. Our analysis reveals four recurring failure patterns: bare assertions, overlooked refutations, stagnation, and premature exit. Motivated by these findings, we examine whether explicit constraint-state tracking during execution mitigates these failures via LiveLedger, an inference-time tracker. This simple intervention consistently improves performance, substantially reducing underverified answers (by up to 26.5%) and improving overall accuracy (by up to 11.6%) on multi-constraint problems.
Paper Structure (53 sections, 10 equations, 12 figures, 3 tables, 2 algorithms)

This paper contains 53 sections, 10 equations, 12 figures, 3 tables, 2 algorithms.

Figures (12)

  • Figure 1: Illusory completion in search agents. The figure shows incorrect and correct answers on multi-constraint problems, highlighting underverified cases with unsupported constraints. Such cases reflect illusory completion, where agents falsely believe a task is complete. The hatched regions highlight the severity of illusory completion in existing models. While verified blue denotes fully supported correct answers, underverified blue and red denote correct answers with unsupported constraints and incorrect answers with agent false belief, respectively.
  • Figure 2: The overview of our work. Please refer to § \ref{['sec:search_agent']}--\ref{['sec:illusory_completion']} together for detailed explanations.
  • Figure 3: Breakdown of incorrect and correct answers by verification status. Results with LiveLedger are shown in bold.
  • Figure 4: Distribution of four illusory completion failure mechanisms under RL training (top) and model-size scaling (bottom).
  • Figure 5: Effect of LiveLedger on failure mechanisms.
  • ...and 7 more figures