You Only Speak Once to See

Wenhao Yang; Jianguo Wei; Wenhuan Lu; Lei Li

You Only Speak Once to See

Wenhao Yang, Jianguo Wei, Wenhuan Lu, Lei Li

TL;DR

The paper tackles audio-grounded visual grounding, proposing YOSS (You Only Speak Once to See) that learns a shared embedding space for image and audio and uses a YOLOv8-based cross-modal query for grounding. It leverages a CLIP-style image encoder and a HuBERT-based audio encoder with an aggregation branch, trained via a two-stage objective: $L_1 = L_{Con}(x_i,x_a) + abla \eta L_{Align}(x_t,x_a)$ for pretraining and $L_2 = L_{cls}(x_i,x_a) + L_{loc}(x_i,x_a)$ for grounding, with encoders frozen during grounding. The study demonstrates audio grounding is feasible and improves with alignment, evaluated on LVIS and COCO with synthesized and human speech data, highlighting potential for open-vocabulary grounding and robotics while noting a performance gap to text-based grounding. This work opens avenues for more natural multi-modal interfaces and robust scene understanding by incorporating spoken language into grounding frameworks, setting a foundation for future improvements in audio-guided perception.

Abstract

Grounding objects in images using visual cues is a well-established approach in computer vision, yet the potential of audio as a modality for object recognition and grounding remains underexplored. We introduce YOSS, "You Only Speak Once to See," to leverage audio for grounding objects in visual scenes, termed Audio Grounding. By integrating pre-trained audio models with visual models using contrastive learning and multi-modal alignment, our approach captures speech commands or descriptions and maps them directly to corresponding objects within images. Experimental results indicate that audio guidance can be effectively applied to object grounding, suggesting that incorporating audio guidance may enhance the precision and robustness of current object grounding methods and improve the performance of robotic systems and computer vision applications. This finding opens new possibilities for advanced object recognition, scene understanding, and the development of more intuitive and capable robotic systems.

You Only Speak Once to See

TL;DR

for pretraining and

for grounding, with encoders frozen during grounding. The study demonstrates audio grounding is feasible and improves with alignment, evaluated on LVIS and COCO with synthesized and human speech data, highlighting potential for open-vocabulary grounding and robotics while noting a performance gap to text-based grounding. This work opens avenues for more natural multi-modal interfaces and robust scene understanding by incorporating spoken language into grounding frameworks, setting a foundation for future improvements in audio-guided perception.

Abstract

Paper Structure (18 sections, 7 equations, 4 figures, 4 tables)

This paper contains 18 sections, 7 equations, 4 figures, 4 tables.

Introduction
Related Work
Visual Grounding and Open-Vocabulary Object Detection
Speech and Audio-Visual Alignment
Self-Supervised Speech Models and Multimodal Integration
Methodology
Audio-Visual Feature Extraction
Audio-Visual Cross-Modal Query
Unified Framework
Experiement
Settings
Dataset
Implement Details
Results
Ablation Studies
...and 3 more sections

Figures (4)

Figure 1: Model predictions on the COCO classes with YOSS.
Figure 2: The YOSS framework for proposed Audio Grounding task.
Figure 3: The Contrastive and Alignemnt Learning of Audio-Visual Grounding.
Figure 4: Text annotation for speech utterance with timestamp whisper model.

You Only Speak Once to See

TL;DR

Abstract

You Only Speak Once to See

Authors

TL;DR

Abstract

Table of Contents

Figures (4)