Table of Contents
Fetching ...

You Only Speak Once to See

Wenhao Yang, Jianguo Wei, Wenhuan Lu, Lei Li

TL;DR

The paper tackles audio-grounded visual grounding, proposing YOSS (You Only Speak Once to See) that learns a shared embedding space for image and audio and uses a YOLOv8-based cross-modal query for grounding. It leverages a CLIP-style image encoder and a HuBERT-based audio encoder with an aggregation branch, trained via a two-stage objective: $L_1 = L_{Con}(x_i,x_a) + abla \eta L_{Align}(x_t,x_a)$ for pretraining and $L_2 = L_{cls}(x_i,x_a) + L_{loc}(x_i,x_a)$ for grounding, with encoders frozen during grounding. The study demonstrates audio grounding is feasible and improves with alignment, evaluated on LVIS and COCO with synthesized and human speech data, highlighting potential for open-vocabulary grounding and robotics while noting a performance gap to text-based grounding. This work opens avenues for more natural multi-modal interfaces and robust scene understanding by incorporating spoken language into grounding frameworks, setting a foundation for future improvements in audio-guided perception.

Abstract

Grounding objects in images using visual cues is a well-established approach in computer vision, yet the potential of audio as a modality for object recognition and grounding remains underexplored. We introduce YOSS, "You Only Speak Once to See," to leverage audio for grounding objects in visual scenes, termed Audio Grounding. By integrating pre-trained audio models with visual models using contrastive learning and multi-modal alignment, our approach captures speech commands or descriptions and maps them directly to corresponding objects within images. Experimental results indicate that audio guidance can be effectively applied to object grounding, suggesting that incorporating audio guidance may enhance the precision and robustness of current object grounding methods and improve the performance of robotic systems and computer vision applications. This finding opens new possibilities for advanced object recognition, scene understanding, and the development of more intuitive and capable robotic systems.

You Only Speak Once to See

TL;DR

The paper tackles audio-grounded visual grounding, proposing YOSS (You Only Speak Once to See) that learns a shared embedding space for image and audio and uses a YOLOv8-based cross-modal query for grounding. It leverages a CLIP-style image encoder and a HuBERT-based audio encoder with an aggregation branch, trained via a two-stage objective: for pretraining and for grounding, with encoders frozen during grounding. The study demonstrates audio grounding is feasible and improves with alignment, evaluated on LVIS and COCO with synthesized and human speech data, highlighting potential for open-vocabulary grounding and robotics while noting a performance gap to text-based grounding. This work opens avenues for more natural multi-modal interfaces and robust scene understanding by incorporating spoken language into grounding frameworks, setting a foundation for future improvements in audio-guided perception.

Abstract

Grounding objects in images using visual cues is a well-established approach in computer vision, yet the potential of audio as a modality for object recognition and grounding remains underexplored. We introduce YOSS, "You Only Speak Once to See," to leverage audio for grounding objects in visual scenes, termed Audio Grounding. By integrating pre-trained audio models with visual models using contrastive learning and multi-modal alignment, our approach captures speech commands or descriptions and maps them directly to corresponding objects within images. Experimental results indicate that audio guidance can be effectively applied to object grounding, suggesting that incorporating audio guidance may enhance the precision and robustness of current object grounding methods and improve the performance of robotic systems and computer vision applications. This finding opens new possibilities for advanced object recognition, scene understanding, and the development of more intuitive and capable robotic systems.
Paper Structure (18 sections, 7 equations, 4 figures, 4 tables)

This paper contains 18 sections, 7 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Model predictions on the COCO classes with YOSS.
  • Figure 2: The YOSS framework for proposed Audio Grounding task.
  • Figure 3: The Contrastive and Alignemnt Learning of Audio-Visual Grounding.
  • Figure 4: Text annotation for speech utterance with timestamp whisper model.