Table of Contents
Fetching ...

Robotic Task Ambiguity Resolution via Natural Language Interaction

Eugenio Chisari, Jan Ole von Hartz, Fabien Despinoy, Abhinav Valada

TL;DR

This work tackles task ambiguity in language-conditioned robotic policies by grounding natural language task descriptions in the observed scene and explicitly reasoning about ambiguity. It introduces AmbResVLM, which grounds task objects, detects ambiguity, generates clarifying queries, and re-grounds user responses within a structured JSON framework to guide downstream policy execution. The approach is built on Molmo with LoRA-based fine-tuning, trained on both simulated and real-world datasets, and shows strong grounding and ambiguity-resolution performance, improving real-robot policy success from $69.6\%$ to $97.1\%$. Across simulation and real-world experiments, AmbResVLM achieves robust multi-modal reasoning and proactive disambiguation, yielding significant practical benefits for natural-language-driven robotics; code and pretrained models are publicly available.

Abstract

Language-conditioned policies have recently gained substantial adoption in robotics as they allow users to specify tasks using natural language, making them highly versatile. While much research has focused on improving the action prediction of language-conditioned policies, reasoning about task descriptions has been largely overlooked. Ambiguous task descriptions often lead to downstream policy failures due to misinterpretation by the robotic agent. To address this challenge, we introduce AmbResVLM, a novel method that grounds language goals in the observed scene and explicitly reasons about task ambiguity. We extensively evaluate its effectiveness in both simulated and real-world domains, demonstrating superior task ambiguity detection and resolution compared to recent state-of-the-art baselines. Finally, real robot experiments show that our model improves the performance of downstream robot policies, increasing the average success rate from 69.6% to 97.1%. We make the data, code, and trained models publicly available at https://ambres.cs.uni-freiburg.de.

Robotic Task Ambiguity Resolution via Natural Language Interaction

TL;DR

This work tackles task ambiguity in language-conditioned robotic policies by grounding natural language task descriptions in the observed scene and explicitly reasoning about ambiguity. It introduces AmbResVLM, which grounds task objects, detects ambiguity, generates clarifying queries, and re-grounds user responses within a structured JSON framework to guide downstream policy execution. The approach is built on Molmo with LoRA-based fine-tuning, trained on both simulated and real-world datasets, and shows strong grounding and ambiguity-resolution performance, improving real-robot policy success from to . Across simulation and real-world experiments, AmbResVLM achieves robust multi-modal reasoning and proactive disambiguation, yielding significant practical benefits for natural-language-driven robotics; code and pretrained models are publicly available.

Abstract

Language-conditioned policies have recently gained substantial adoption in robotics as they allow users to specify tasks using natural language, making them highly versatile. While much research has focused on improving the action prediction of language-conditioned policies, reasoning about task descriptions has been largely overlooked. Ambiguous task descriptions often lead to downstream policy failures due to misinterpretation by the robotic agent. To address this challenge, we introduce AmbResVLM, a novel method that grounds language goals in the observed scene and explicitly reasons about task ambiguity. We extensively evaluate its effectiveness in both simulated and real-world domains, demonstrating superior task ambiguity detection and resolution compared to recent state-of-the-art baselines. Finally, real robot experiments show that our model improves the performance of downstream robot policies, increasing the average success rate from 69.6% to 97.1%. We make the data, code, and trained models publicly available at https://ambres.cs.uni-freiburg.de.

Paper Structure

This paper contains 11 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: AmbResVLM reasons about the provided task description and grounds it in the observed scene. It then predicts whether the task is ambiguous and generates the appropriate query to the human user to disambiguate it. Finally, it interprets the user clarification, grounding the information again in the observed scene and enabling successful task execution.
  • Figure 2: Illustration of the reasoning process in AmbResVLM. The model is provided with a task description in natural language and an image observation of the scene. The first reasoning step consists of grounding the task-relevant objects in the visual observation and classifying whether the task-image pair is ambiguous or not. If the task is deemed ambiguous, a user query is generated. The second reasoning step consists of interpreting the feedback from the user, grounding the now unambiguous task objects, and predicting their locations in the image. Optionally, the final step consists of using the predicted coordinates to prompt the SegmentAnything model ravi2024sam2 to predict object masks.
  • Figure 3: Sample images from our dataset. Left: from the simulation environment. Right: from the real-world data.
  • Figure 4: Real-world policy learning tasks. Top: Place the (blue) cup on the tray. Middle: Place the (blue) bowl next to the mug. Bottom: Put the bread (on the plate) on the baking tray. Each task is evaluated under four conditions. Left: with an unambiguous setup. Middle: color ambiguity, with additional clutter objects of the same color. Right: instance-level ambiguity, with multiple objects of the same class. Here, we distinguish two cases. First, additional object instances are present, but we still request the original objects. I.e., the additional instances serve only as distractors. Second, we also request a different object, e.g., the yellow cup instead of the blue cup.