Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification
Moises Andrade, Joonhyuk Cha, Brandon Ho, Vriksha Srihari, Karmesh Yadav, Zsolt Kira
TL;DR
This work identifies a pervasive agreement bias in multimodal LLM verifiers, wherein models over-validate flawed agent behavior across open-ended tasks. It proposes Self-Grounded Verification (SGV), a two-step method that first elicits broad priors from an MLLM and then evaluates trajectories conditioned on those priors, leading to more human-aligned judgments. SGV yields substantial improvements in failure detection and accuracy (up to 25 pp and 14 pp, respectively) and enables better downstream performance, including new state-of-the-art results on VisualWebArena and gains in online supervision and self-refinement. The study emphasizes careful evaluation using fine-grained metrics and introduces an enhanced VisualWebArena with higher fidelity, parallelism, and speed, while outlining limitations and future directions for combining MLLMs with other grounding approaches. The work has practical impact for building more reliable, scalable verifiers in multimodal and interactive AI systems.
Abstract
Verifiers--functions assigning rewards to agent behavior--have been key for AI progress in domains like math and code. However, extending gains to domains without clear-cut success criteria (e.g., computer use) remains a challenge: while humans can recognize desired outcomes, translating this intuition into scalable rules is nontrivial. Multimodal Large Language Models (MLLMs) emerge as a promising solution, given their world knowledge, human-preference alignment, and reasoning skills. We evaluate MLLMs as verifiers across web navigation, computer use, and robotic manipulation, and identify a critical limitation: a strong tendency to over-validate agent behavior, a phenomenon we term agreement bias. This bias is pervasive across models, resilient to test-time scaling, and poses risks to existing methods relying on MLLM evaluations. We discuss methods to evaluate and improve MLLM verifiers and introduce Self-Grounded Verification (SGV), a lightweight method that harnesses MLLMs' own sampling mechanisms by modulating (un)conditional generation to better leverage their knowledge, alignment, and reasoning. SGV operates in two steps: first, the MLLM is elicited to generate broad priors about desired behavior, independent of the data under evaluation. Then, conditioned on self-generated priors, it reasons over and evaluates a candidate trajectory. SGV yields more human-aligned evaluations with gains of up to 25pp in failure detection, 14pp in accuracy, and benefits extending to downstream applications. In self-refinement and online supervision, SGV boosts task completion of a GUI specialist in OSWorld, a diffusion policy in robomimic, and a ReAct agent in VisualWebArena--setting a new state of the art, surpassing the previous best by 20pp. We release an updated version of VisualWebArena featuring more human-aligned evaluators, high-fidelity environment parallelism, and speedups of over 10x.
