Table of Contents
Fetching ...

Decoupled Audio-Visual Dataset Distillation

Wenyuan Li, Guang Li, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama

TL;DR

This work targets efficient audio–visual dataset distillation by addressing DM's inability to capture cross-modal alignment and the instability caused by jointly optimizing cross-modal objectives. It introduces DAVDD, a decoupled framework that uses a diverse pre-trained bank and a lightweight decoupler bank to split features into common (shared) and private (modality-specific) components, together with Common Intermodal Matching and Sample–Distribution Joint Alignment to preserve cross-modal structure while safeguarding private cues. The training objective decouples private and common learning via targeted losses, supplemented by inter-sample and distribution-level alignment and an EMA prototype bank for global consistency. Empirically, DAVDD achieves state-of-the-art results across VGGS-10K, MUSIC-21, and AVE under various IPC settings, demonstrating improved cross-modal fidelity, robustness across architectures, and strong potential for scalable AV dataset distillation.

Abstract

Audio-Visual Dataset Distillation aims to compress large-scale datasets into compact subsets while preserving the performance of the original data. However, conventional Distribution Matching (DM) methods struggle to capture intrinsic cross-modal alignment. Subsequent studies have attempted to introduce cross-modal matching, but two major challenges remain: (i) independently and randomly initialized encoders lead to inconsistent modality mapping spaces, increasing training difficulty; and (ii) direct interactions between modalities tend to damage modality-specific (private) information, thereby degrading the quality of the distilled data. To address these challenges, we propose DAVDD, a pretraining-based decoupled audio-visual distillation framework. DAVDD leverages a diverse pretrained bank to obtain stable modality features and uses a lightweight decoupler bank to disentangle them into common and private representations. To effectively preserve cross-modal structure, we further introduce Common Intermodal Matching together with a Sample-Distribution Joint Alignment strategy, ensuring that shared representations are aligned both at the sample level and the global distribution level. Meanwhile, private representations are entirely isolated from cross-modal interaction, safeguarding modality-specific cues throughout distillation. Extensive experiments across multiple benchmarks show that DAVDD achieves state-of-the-art results under all IPC settings, demonstrating the effectiveness of decoupled representation learning for high-quality audio-visual dataset distillation. Code will be released.

Decoupled Audio-Visual Dataset Distillation

TL;DR

This work targets efficient audio–visual dataset distillation by addressing DM's inability to capture cross-modal alignment and the instability caused by jointly optimizing cross-modal objectives. It introduces DAVDD, a decoupled framework that uses a diverse pre-trained bank and a lightweight decoupler bank to split features into common (shared) and private (modality-specific) components, together with Common Intermodal Matching and Sample–Distribution Joint Alignment to preserve cross-modal structure while safeguarding private cues. The training objective decouples private and common learning via targeted losses, supplemented by inter-sample and distribution-level alignment and an EMA prototype bank for global consistency. Empirically, DAVDD achieves state-of-the-art results across VGGS-10K, MUSIC-21, and AVE under various IPC settings, demonstrating improved cross-modal fidelity, robustness across architectures, and strong potential for scalable AV dataset distillation.

Abstract

Audio-Visual Dataset Distillation aims to compress large-scale datasets into compact subsets while preserving the performance of the original data. However, conventional Distribution Matching (DM) methods struggle to capture intrinsic cross-modal alignment. Subsequent studies have attempted to introduce cross-modal matching, but two major challenges remain: (i) independently and randomly initialized encoders lead to inconsistent modality mapping spaces, increasing training difficulty; and (ii) direct interactions between modalities tend to damage modality-specific (private) information, thereby degrading the quality of the distilled data. To address these challenges, we propose DAVDD, a pretraining-based decoupled audio-visual distillation framework. DAVDD leverages a diverse pretrained bank to obtain stable modality features and uses a lightweight decoupler bank to disentangle them into common and private representations. To effectively preserve cross-modal structure, we further introduce Common Intermodal Matching together with a Sample-Distribution Joint Alignment strategy, ensuring that shared representations are aligned both at the sample level and the global distribution level. Meanwhile, private representations are entirely isolated from cross-modal interaction, safeguarding modality-specific cues throughout distillation. Extensive experiments across multiple benchmarks show that DAVDD achieves state-of-the-art results under all IPC settings, demonstrating the effectiveness of decoupled representation learning for high-quality audio-visual dataset distillation. Code will be released.

Paper Structure

This paper contains 25 sections, 30 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: A comparison of architectures for multimodal audio-visual dataset distillation. Subfigures (a), (b), and (c) depict the frameworks of Distillation Matching (DM), Audio-Visual Dataset Distillation (AVDD), and the proposed Decoupled Audio-Visual Dataset Distillation (DAVDD), respectively. Subfigure (d) reports performance on the VGGS-10K dataset under IPC 1 and IPC 10 settings, where DAVDD consistently surpasses prior methods in both accuracy and variance.
  • Figure 2: Overview of the DAVDD framework. DAVDD consists of three key components: a pre-trained model bank providing stable and diverse audio–visual encoders, a decoupler bank that projects encoder outputs into shared and private representation spaces, and a decoupled distillation process that performs intra-modal matching on private features and cross-modal alignment on shared features. By combining Sample–Distribution Joint Alignment with Common Intermodal Matching, DAVDD preserves modality-specific information while effectively capturing audio–visual correlations, enabling the synthesis of high-fidelity multimodal datasets.
  • Figure 3: The figure shows the synthesized samples for the class "chicken crowing" from the VGGS-10K dataset with IPC = 10. The top row shows the distilled visual data, while the bottom row shows the corresponding visualized spectrograms of the distilled audio.
  • Figure 4: The losses in the AVDD and DAVDD during the distillation process.
  • Figure 5: The figure presents the visualization of a sample image from the VGGS-10K dataset at IPC = 10 using different methods. (a) shows the original image, (b) shows the distilled image generated by the DM method, and (c) shows the distilled image generated by our DAVDD method.
  • ...and 3 more figures