DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information

Shota Nakada; Taichi Nishimura; Hokuto Munakata; Masayoshi Kondo; Tatsuya Komatsu

DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information

Shota Nakada, Taichi Nishimura, Hokuto Munakata, Masayoshi Kondo, Tatsuya Komatsu

TL;DR

DETECLAP addresses the gap in fine-grained object understanding in audio-visual representation learning by augmenting CAV-MAE with an audio-visual label prediction loss. It automatically derives object labels from audio with CLAP and from visuals with YOLOv8, then merges these labels via AND/OR strategies to supervise the model. The approach yields consistent gains in retrieval and classification on VGGSound and AudioSet20K, highlighting the value of object-aware, cross-modal supervision. This object-centric enhancement advances multimodal learning by bridging auditory and visual object representations for more accurate cross-modal retrieval and recognition.

Abstract

Current audio-visual representation learning can capture rough object categories (e.g., ``animals'' and ``instruments''), but it lacks the ability to recognize fine-grained details, such as specific categories like ``dogs'' and ``flutes'' within animals and instruments. To address this issue, we introduce DETECLAP, a method to enhance audio-visual representation learning with object information. Our key idea is to introduce an audio-visual label prediction loss to the existing Contrastive Audio-Visual Masked AutoEncoder to enhance its object awareness. To avoid costly manual annotations, we prepare object labels from both audio and visual inputs using state-of-the-art language-audio models and object detectors. We evaluate the method of audio-visual retrieval and classification using the VGGSound and AudioSet20K datasets. Our method achieves improvements in recall@10 of +1.5% and +1.2% for audio-to-visual and visual-to-audio retrieval, respectively, and an improvement in accuracy of +0.6% for audio-visual classification.

DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information

TL;DR

Abstract

Paper Structure (13 sections, 7 equations, 4 figures, 3 tables)

This paper contains 13 sections, 7 equations, 4 figures, 3 tables.

Introduction
Preliminary: CAV-MAE
method
Audio-visual label prediction loss
Acquiring audio labels using CLAP
Acquiring visual labels using object detectors
Merging strategies
Experiments
Experimental settings
Results on audio-visual retrieval
Results on audio-visual classification
Comparative Study
Conclusion

Figures (4)

Figure 1: Failure cases of CAV-MAE in audio-visual retrieval. Although CAV-MAE captures rough category (e.g., "animals" and "wind instruments"), it lacks fine-grained object information ("dogs" and "flute").
Figure 2: Overview of the proposed method DETECLAP. To enhance CAV-MAE with object information, we apply CLAP and object detector to the videos in the dataset, thereby acquiring audio-visual labels. Based on these labels, we train CAV-MAE with audio-visual label prediction loss.
Figure 3: The sensitivity of varying thresholds on performance during audio-visual label generation. The left figure shows the performance changes when adjusting the threshold for CLAP, while the right figure details the changes for YOLOv8.
Figure 4: Retrieved visual/audio from input audio/visual queries.

DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information

TL;DR

Abstract

DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information

Authors

TL;DR

Abstract

Table of Contents

Figures (4)