Table of Contents
Fetching ...

Layover or Direct Flight: Rethinking Audio-Guided Image Segmentation

Joel Alberto Santos, Zongwei Wu, Xavier Alameda-Pineda, Radu Timofte

TL;DR

The paper investigates grounding a target object in images using spoken language without transcription, introducing a dataset of images paired with clean, keyword-only audio and evaluating direct audio-visual grounding against transcription-based pipelines. By adapting AVS models to a single-frame, short-audio setting and benchmarking with standard metrics, it shows that direct audio grounding can achieve comparable or superior accuracy with lower latency and fewer parameters. The authors perform qualitative analyses and ablations on fusion strategies, revealing modality-specific differences and the importance of cross-modal alignment design. Overall, the work advocates for end-to-end audio-visual grounding as a robust alternative to text-first pipelines, with practical implications for real-time robotics and multimodal understanding.

Abstract

Understanding human instructions is essential for enabling smooth human-robot interaction. In this work, we focus on object grounding, i.e., localizing an object of interest in a visual scene (e.g., an image) based on verbal human instructions. Despite recent progress, a dominant research trend relies on using text as an intermediate representation. These approaches typically transcribe speech to text, extract relevant object keywords, and perform grounding using models pretrained on large text-vision datasets. However, we question both the efficiency and robustness of such transcription-based pipelines. Specifically, we ask: Can we achieve direct audio-visual alignment without relying on text? To explore this possibility, we simplify the task by focusing on grounding from single-word spoken instructions. We introduce a new audio-based grounding dataset that covers a wide variety of objects and diverse human accents. We then adapt and benchmark several models from the closely audio-visual field. Our results demonstrate that direct grounding from audio is not only feasible but, in some cases, even outperforms transcription-based methods, especially in terms of robustness to linguistic variability. Our findings encourage a renewed interest in direct audio grounding and pave the way for more robust and efficient multimodal understanding systems.

Layover or Direct Flight: Rethinking Audio-Guided Image Segmentation

TL;DR

The paper investigates grounding a target object in images using spoken language without transcription, introducing a dataset of images paired with clean, keyword-only audio and evaluating direct audio-visual grounding against transcription-based pipelines. By adapting AVS models to a single-frame, short-audio setting and benchmarking with standard metrics, it shows that direct audio grounding can achieve comparable or superior accuracy with lower latency and fewer parameters. The authors perform qualitative analyses and ablations on fusion strategies, revealing modality-specific differences and the importance of cross-modal alignment design. Overall, the work advocates for end-to-end audio-visual grounding as a robust alternative to text-first pipelines, with practical implications for real-time robotics and multimodal understanding.

Abstract

Understanding human instructions is essential for enabling smooth human-robot interaction. In this work, we focus on object grounding, i.e., localizing an object of interest in a visual scene (e.g., an image) based on verbal human instructions. Despite recent progress, a dominant research trend relies on using text as an intermediate representation. These approaches typically transcribe speech to text, extract relevant object keywords, and perform grounding using models pretrained on large text-vision datasets. However, we question both the efficiency and robustness of such transcription-based pipelines. Specifically, we ask: Can we achieve direct audio-visual alignment without relying on text? To explore this possibility, we simplify the task by focusing on grounding from single-word spoken instructions. We introduce a new audio-based grounding dataset that covers a wide variety of objects and diverse human accents. We then adapt and benchmark several models from the closely audio-visual field. Our results demonstrate that direct grounding from audio is not only feasible but, in some cases, even outperforms transcription-based methods, especially in terms of robustness to linguistic variability. Our findings encourage a renewed interest in direct audio grounding and pave the way for more robust and efficient multimodal understanding systems.

Paper Structure

This paper contains 10 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Histogram on the gain. Direct speech grounding outperforms transcription grounding based on 5881 of 9905 images ($\approx$ 60%). The models used for comparison correspond to the best performing representatives of their respective categories. While image grounding has traditionally been approached as a text-driven task, achieving remarkable success in recent years c34c38c39. When combined with ASR for natural human–computer interaction, the pipeline can accumulate errors. In this work, we explore direct speech-driven grounding, bypassing intermediate text conversion, which improves robustness to subtle tonal variations often overlooked by text-based models.
  • Figure 2: Visual dataset distribution composed of frames selected from the videos introduced in c07c17c26.
  • Figure 3: Qualitative results from our benchmark. Top rows: multi-object cases. Bottom row: accent variations.
  • Figure 4: Qualitative Results. Segmentation results under varied conditions, including multiple object classes, multiple instances of the same class, and single objects with phrasing variations. The middle columns show transcription model outputs with different intonations captured, while the final column presents outputs from the direct speech to vision system.
  • Figure 5: Qualitative Results. Discrepancies arise between the input text or audio descriptions and the corresponding visual scenes, particularly when objects that are visually similar to each other are present in the visual scene. The performance of transcription models is notably and substantially reduced due to errors in interpretation, such as the misidentification of "sea lion" as the phrase "See you, Lion!" This type of mistake leads to significant inaccuracies in the segmentation process
  • ...and 2 more figures