When Is Enough Not Enough? Illusory Completion in Search Agents
Dayoon Ko, Jihyuk Kim, Sohyeon Kim, Haeju Park, Dahyun Lee, Gunhee Kim, Moontae Lee, Kyungjae Lee
TL;DR
The paper investigates illusory completion in search agents solving multi-constraint questions, revealing that final-answer correctness often disguises unverified or violated constraints. It introduces the Epistemic Ledger to jointly track evidential support and agent beliefs for each constraint, and identifies four failure modes: bare assertion, overlooked refutation, stagnation, and premature exit. The study shows that explicit constraint-state tracking with LiveLedger reduces underverification and boosts accuracy across diverse agents, while also enhancing exploration efficiency. These findings underscore the importance of process-aware evaluation in multi-constraint QA and offer a practical mechanism to improve reliability in long-horizon search agents.
Abstract
Recent search agents leverage multi-turn reasoning and search tools to achieve strong performance on multi-hop and long-horizon benchmarks. Yet it remains unclear whether they reliably reason across all requirements by tracking, verifying, and maintaining multiple conditions in these questions. We study this capability under multi-constraint problems, where valid answers must satisfy several constraints simultaneously. We find that illusory completion frequently occurs, wherein agents believe tasks are complete despite unresolved or violated constraints, leading to underverified answers. To diagnose this behavior, we introduce the Epistemic Ledger, an evaluation framework that tracks evidential support and agents' beliefs for each constraint throughout multi-turn reasoning. Our analysis reveals four recurring failure patterns: bare assertions, overlooked refutations, stagnation, and premature exit. Motivated by these findings, we examine whether explicit constraint-state tracking during execution mitigates these failures via LiveLedger, an inference-time tracker. This simple intervention consistently improves performance, substantially reducing underverified answers (by up to 26.5%) and improving overall accuracy (by up to 11.6%) on multi-constraint problems.
