You Only Speak Once to See
Wenhao Yang, Jianguo Wei, Wenhuan Lu, Lei Li
TL;DR
The paper tackles audio-grounded visual grounding, proposing YOSS (You Only Speak Once to See) that learns a shared embedding space for image and audio and uses a YOLOv8-based cross-modal query for grounding. It leverages a CLIP-style image encoder and a HuBERT-based audio encoder with an aggregation branch, trained via a two-stage objective: $L_1 = L_{Con}(x_i,x_a) + abla \eta L_{Align}(x_t,x_a)$ for pretraining and $L_2 = L_{cls}(x_i,x_a) + L_{loc}(x_i,x_a)$ for grounding, with encoders frozen during grounding. The study demonstrates audio grounding is feasible and improves with alignment, evaluated on LVIS and COCO with synthesized and human speech data, highlighting potential for open-vocabulary grounding and robotics while noting a performance gap to text-based grounding. This work opens avenues for more natural multi-modal interfaces and robust scene understanding by incorporating spoken language into grounding frameworks, setting a foundation for future improvements in audio-guided perception.
Abstract
Grounding objects in images using visual cues is a well-established approach in computer vision, yet the potential of audio as a modality for object recognition and grounding remains underexplored. We introduce YOSS, "You Only Speak Once to See," to leverage audio for grounding objects in visual scenes, termed Audio Grounding. By integrating pre-trained audio models with visual models using contrastive learning and multi-modal alignment, our approach captures speech commands or descriptions and maps them directly to corresponding objects within images. Experimental results indicate that audio guidance can be effectively applied to object grounding, suggesting that incorporating audio guidance may enhance the precision and robustness of current object grounding methods and improve the performance of robotic systems and computer vision applications. This finding opens new possibilities for advanced object recognition, scene understanding, and the development of more intuitive and capable robotic systems.
