Table of Contents
Fetching ...

AMIGO: Agentic Multi-Image Grounding Oracle Benchmark

Min Wang, Ata Mahjoubfar

Abstract

Agentic vision-language models increasingly act through extended interactions, but most evaluations still focus on single-image, single-turn correctness. We introduce AMIGO (Agentic Multi-Image Grounding Oracle Benchmark), a long-horizon benchmark for hidden-target identification over galleries of visually similar images. In AMIGO, the oracle privately selects a target image, and the model must recover it by asking a sequence of attribute-focused Yes/No/Unsure questions under a strict protocol that penalizes invalid actions with Skip. This setting stresses (i) question selection under uncertainty, (ii) consistent constraint tracking across turns, and (iii) fine-grained discrimination as evidence accumulates. AMIGO also supports controlled oracle imperfections to probe robustness and verification behavior under inconsistent feedback. We instantiate AMIGO with Guess My Preferred Dress task and report metrics covering both outcomes and interaction quality, including identification success, evidence verification, efficiency, protocol compliance, noise tolerance, and trajectory-level diagnostics.

AMIGO: Agentic Multi-Image Grounding Oracle Benchmark

Abstract

Agentic vision-language models increasingly act through extended interactions, but most evaluations still focus on single-image, single-turn correctness. We introduce AMIGO (Agentic Multi-Image Grounding Oracle Benchmark), a long-horizon benchmark for hidden-target identification over galleries of visually similar images. In AMIGO, the oracle privately selects a target image, and the model must recover it by asking a sequence of attribute-focused Yes/No/Unsure questions under a strict protocol that penalizes invalid actions with Skip. This setting stresses (i) question selection under uncertainty, (ii) consistent constraint tracking across turns, and (iii) fine-grained discrimination as evidence accumulates. AMIGO also supports controlled oracle imperfections to probe robustness and verification behavior under inconsistent feedback. We instantiate AMIGO with Guess My Preferred Dress task and report metrics covering both outcomes and interaction quality, including identification success, evidence verification, efficiency, protocol compliance, noise tolerance, and trajectory-level diagnostics.

Paper Structure

This paper contains 33 sections, 1 equation, 16 figures.

Figures (16)

  • Figure 1: Two example multi-turn interactions for Guess My Preferred Dress. The VLM (blue) asks constrained Yes/No questions about fine-grained attributes; the user replies with Yes (green), No (red), or Unsure (yellow). Panel (a) shows an incorrect final guess and panel (b) shows a correct one.
  • Figure 2: Two example multi-turn interactions (continued).
  • Figure 3: The semi-automatic attribute labeling pipeline: attribute discovery and normalization, binary question template construction, and ensembled VLM-based labeling with quality control.
  • Figure 4: Image gallery generation pipeline: for a given target image, distractors are retrieved by attribute-based similarity and merged into a gallery with controlled difficulty via threshold $\tau$ and gallery size.
  • Figure 5: Four sample dress galleries from AMIGO. Each gallery contains one target image and visually similar distractors. The target is highlighted with a red outline.
  • ...and 11 more figures