Table of Contents
Fetching ...

Plug-and-Play Clarifier: A Zero-Shot Multimodal Framework for Egocentric Intent Disambiguation

Sicheng Yang, Yukai Huang, Weitong Cai, Shitong Sun, You He, Jiankang Deng, Hang Zhang, Jifei Song, Zhensong Zhang

TL;DR

The paper tackles multimodal intent ambiguity in egocentric interaction by introducing the Plug-and-Play Clarifier, a zero-shot, modular framework that decomposes ambiguity into text, vision, and cross-modal sub-tasks. It combines dialogue-driven clarification, real-time visual quality feedback, and precise 3D gesture grounding to enable robust, plug-in integration with existing foundation models without fine-tuning. Quantitative results show substantial gains for small LMs (≈30% on textual disambiguation), vision clarifier improvements (>20% in corrective guidance), and cross-modal grounding enhancements (~5% semantic accuracy), validated on new VRA-Ego and established CLAMBER/IN3 datasets. The approach demonstrates that a hybrid architecture—LLMs guided by deterministic algorithms and structured iterative reasoning—offers a practical, efficient path toward reliable, embodied egocentric AI, with strong implications for AR/wearable interfaces and future robotic assistants.

Abstract

The performance of egocentric AI agents is fundamentally limited by multimodal intent ambiguity. This challenge arises from a combination of underspecified language, imperfect visual data, and deictic gestures, which frequently leads to task failure. Existing monolithic Vision-Language Models (VLMs) struggle to resolve these multimodal ambiguous inputs, often failing silently or hallucinating responses. To address these ambiguities, we introduce the Plug-and-Play Clarifier, a zero-shot and modular framework that decomposes the problem into discrete, solvable sub-tasks. Specifically, our framework consists of three synergistic modules: (1) a text clarifier that uses dialogue-driven reasoning to interactively disambiguate linguistic intent, (2) a vision clarifier that delivers real-time guidance feedback, instructing users to adjust their positioning for improved capture quality, and (3) a cross-modal clarifier with grounding mechanism that robustly interprets 3D pointing gestures and identifies the specific objects users are pointing to. Extensive experiments demonstrate that our framework improves the intent clarification performance of small language models (4--8B) by approximately 30%, making them competitive with significantly larger counterparts. We also observe consistent gains when applying our framework to these larger models. Furthermore, our vision clarifier increases corrective guidance accuracy by over 20%, and our cross-modal clarifier improves semantic answer accuracy for referential grounding by 5%. Overall, our method provides a plug-and-play framework that effectively resolves multimodal ambiguity and significantly enhances user experience in egocentric interaction.

Plug-and-Play Clarifier: A Zero-Shot Multimodal Framework for Egocentric Intent Disambiguation

TL;DR

The paper tackles multimodal intent ambiguity in egocentric interaction by introducing the Plug-and-Play Clarifier, a zero-shot, modular framework that decomposes ambiguity into text, vision, and cross-modal sub-tasks. It combines dialogue-driven clarification, real-time visual quality feedback, and precise 3D gesture grounding to enable robust, plug-in integration with existing foundation models without fine-tuning. Quantitative results show substantial gains for small LMs (≈30% on textual disambiguation), vision clarifier improvements (>20% in corrective guidance), and cross-modal grounding enhancements (~5% semantic accuracy), validated on new VRA-Ego and established CLAMBER/IN3 datasets. The approach demonstrates that a hybrid architecture—LLMs guided by deterministic algorithms and structured iterative reasoning—offers a practical, efficient path toward reliable, embodied egocentric AI, with strong implications for AR/wearable interfaces and future robotic assistants.

Abstract

The performance of egocentric AI agents is fundamentally limited by multimodal intent ambiguity. This challenge arises from a combination of underspecified language, imperfect visual data, and deictic gestures, which frequently leads to task failure. Existing monolithic Vision-Language Models (VLMs) struggle to resolve these multimodal ambiguous inputs, often failing silently or hallucinating responses. To address these ambiguities, we introduce the Plug-and-Play Clarifier, a zero-shot and modular framework that decomposes the problem into discrete, solvable sub-tasks. Specifically, our framework consists of three synergistic modules: (1) a text clarifier that uses dialogue-driven reasoning to interactively disambiguate linguistic intent, (2) a vision clarifier that delivers real-time guidance feedback, instructing users to adjust their positioning for improved capture quality, and (3) a cross-modal clarifier with grounding mechanism that robustly interprets 3D pointing gestures and identifies the specific objects users are pointing to. Extensive experiments demonstrate that our framework improves the intent clarification performance of small language models (4--8B) by approximately 30%, making them competitive with significantly larger counterparts. We also observe consistent gains when applying our framework to these larger models. Furthermore, our vision clarifier increases corrective guidance accuracy by over 20%, and our cross-modal clarifier improves semantic answer accuracy for referential grounding by 5%. Overall, our method provides a plug-and-play framework that effectively resolves multimodal ambiguity and significantly enhances user experience in egocentric interaction.

Paper Structure

This paper contains 36 sections, 5 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Our Clarifier framework resolves the multimodal ambiguous query, "Is this a good gift?" (a) Multimodal Intent Ambiguity: A standard AI defaults to a guess, making assumptions about the recipient (e.g., a child), their interests, and the user's budget. This "black box" approach is unhelpful because if the assumptions are wrong, the recommendation is useless. (b) Plug-and-Play Clarifier: Our system avoids guessing. It first identifies that key information (recipient, occasion, budget), missing visual context, and pointing gestures between modalities. It then proactively asks clarifying questions and provides camera feedback ("For who? Move camera upward..."). Once the user provides the necessary context, the system can deliver a relevant and genuinely helpful recommendation.
  • Figure 2: An overview of our clarification pipeline, a plug-in module for resolving ambiguous multimodal queries. The pipeline identifies and addresses three types of underspecification: (1) semantic ambiguity in language (e.g., "a good gift") is clarified through dialogue; (2) visual ambiguity from unclear object views is handled by requesting a better view; and (3) referential ambiguity from pointing gestures (e.g., "this") is improved by adaptive image cropping.
  • Figure 3: Overview of our vision-based clarification module. Given a user's query about a physical object, the system first identifies the target class (e.g., "menu") using an VLM. An open-set detector then localizes the object in the image frame. Subsequently, the visual quality is assessed for framing integrity and clarity. If issues like improper framing or blurriness are detected, the system provides real-time corrective feedback to the user, ensuring high-quality visual input before proceeding.
  • Figure 4: Our multi-stage pipeline for resolving cross-modal referential ambiguity. From a single image, we (1) estimate a 3D pointing ray from the user's hand gesture, (2) cast this ray into the scene to find a 3D intersection point, and (3) identify the target object and generate a context-aware crop containing both the hand and the object, which is then passed to a VLM for final interpretation.
  • Figure 5: A comparative analysis of our proposed framework against the baseline across three open-source LLM families: (a) Qwen2.5, (b) Qwen3, and (c) Llama-3.1. The evaluation spans a range of model sizes to assess performance scalability. The primary performance metric, Recover Rate (representing the recovery of critical missing details), is shown using bar charts for both our method (blue) and the baseline (red). Additionally, the Accuracy (Baseline) is plotted as a dashed red line, while the Average Conversation Rounds (denoted by 'r' values) are annotated above each bar to measure dialogue efficiency.
  • ...and 4 more figures