Table of Contents
Fetching ...

Benchmarking Interaction, Beyond Policy: a Reproducible Benchmark for Collaborative Instance Object Navigation

Edoardo Zorzi, Francesco Taioli, Yiming Wang, Marco Cristani, Alessandro Farinelli, Alberto Castellini, Loris Bazzani

Abstract

We propose Question-Asking Navigation (QAsk-Nav), the first reproducible benchmark for Collaborative Instance Object Navigation (CoIN) that enables an explicit, separate assessment of embodied navigation and collaborative question asking. CoIN tasks an embodied agent with reaching a target specified in free-form natural language under partial observability, using only egocentric visual observations and interactive natural-language dialogue with a human, where the dialogue can help to resolve ambiguity among visually similar object instances. Existing CoIN benchmarks are primarily focused on navigation success and offer no support for consistent evaluation of collaborative interaction. To address this limitation, QAsk-Nav provides (i) a lightweight question-asking protocol scored independently of navigation, (ii) an enhanced navigation protocol with realistic, diverse, high-quality target descriptions, and (iii) an open-source dataset, that includes 28,000 quality-checked reasoning and question-asking traces for training and analysis of interactive capabilities of CoIN models. Using the proposed QAsk-Nav benchmark, we develop Light-CoNav, a lightweight unified model for collaborative navigation that is 3x smaller and 70x faster than existing modular methods, while outperforming state-of-the-art CoIN approaches in generalization to unseen objects and environments. Project page at https://benchmarking-interaction.github.io/

Benchmarking Interaction, Beyond Policy: a Reproducible Benchmark for Collaborative Instance Object Navigation

Abstract

We propose Question-Asking Navigation (QAsk-Nav), the first reproducible benchmark for Collaborative Instance Object Navigation (CoIN) that enables an explicit, separate assessment of embodied navigation and collaborative question asking. CoIN tasks an embodied agent with reaching a target specified in free-form natural language under partial observability, using only egocentric visual observations and interactive natural-language dialogue with a human, where the dialogue can help to resolve ambiguity among visually similar object instances. Existing CoIN benchmarks are primarily focused on navigation success and offer no support for consistent evaluation of collaborative interaction. To address this limitation, QAsk-Nav provides (i) a lightweight question-asking protocol scored independently of navigation, (ii) an enhanced navigation protocol with realistic, diverse, high-quality target descriptions, and (iii) an open-source dataset, that includes 28,000 quality-checked reasoning and question-asking traces for training and analysis of interactive capabilities of CoIN models. Using the proposed QAsk-Nav benchmark, we develop Light-CoNav, a lightweight unified model for collaborative navigation that is 3x smaller and 70x faster than existing modular methods, while outperforming state-of-the-art CoIN approaches in generalization to unseen objects and environments. Project page at https://benchmarking-interaction.github.io/

Paper Structure

This paper contains 28 sections, 17 figures, 7 tables.

Figures (17)

  • Figure 1: QAsk-Nav introduces two distinct protocols for question asking (top-left) and for navigation (top-right), supported by a novel dataset with structured reasoning, and question annotations (bottom). Our dataset enables reproducible evaluation, training, and analysis under both protocols to study the agent's capabilities of interaction reasoning and question-asking.
  • Figure 2: Examples from QAsk-Nav. Left: original image. Center: distractor with an altered sofa and painting colors. Right: distractor with an altered plant and painting.
  • Figure 3: Episode from QAsk-Nav. Column 1: the target image available only to the oracle and the navigation instructions provided to the agent. Columns 2-3: different observations, including image, question, and answer from the oracle. Before each question and conclusion, the model produces a reasoning trace, motivating the decision. More detailed examples in Supp. Mat.\ref{['sec:appendix_qask']}.
  • Figure 4: Updated Task Annotations. Comparison between CoIN-Bench taioli2025coin annotations and our QAsk-Nav annotations. Old annotations often contain confusing, nonsensical, or hallucinated descriptions, while new annotations are clearer and more accurate.
  • Figure 5: Two samples $(D,O,R,S,Q,C)$ from QAsk-Nav-dataset. In the first example, the description $D$ matches the observation $O$ and thus the reasoning $R$ and score $S$ indicate a match (between $D$ and $O$). In the second example, the description is ambiguous, and previous answers did not help: therefore $R$ and $S$ express uncertainty.
  • ...and 12 more figures