Table of Contents
Fetching ...

MDSGen: Fast and Efficient Masked Diffusion Temporal-Aware Transformers for Open-Domain Sound Generation

Trung X. Pham, Tri Ton, Chang D. Yoo

TL;DR

MDSGen tackles vision-guided open-domain sound generation by replacing heavy Unet backbones with a lightweight denoising diffusion Transformer conditioned on a learned video representation. Key innovations include a Reducer that eliminates redundant video features and a Temporal-Aware Masking strategy to exploit audio temporal structure, enabling strong alignment with far fewer parameters. On VGGSound, a 5M-parameter Tiny model achieves 97.9% alignment accuracy and orders-of-magnitude efficiency gains over 860M-parameter baselines, while a 131M Base model nears 99% alignment, demonstrating scalability. The method also generalizes to Flickr-SoundNet with competitive cross-modal metrics, and ablations validate the benefits of TAM, Reducer, and guidance strategies, highlighting practical impact for fast, resource-efficient video-to-audio generation.)

Abstract

We introduce MDSGen, a novel framework for vision-guided open-domain sound generation optimized for model parameter size, memory consumption, and inference speed. This framework incorporates two key innovations: (1) a redundant video feature removal module that filters out unnecessary visual information, and (2) a temporal-aware masking strategy that leverages temporal context for enhanced audio generation accuracy. In contrast to existing resource-heavy Unet-based models, \texttt{MDSGen} employs denoising masked diffusion transformers, facilitating efficient generation without reliance on pre-trained diffusion models. Evaluated on the benchmark VGGSound dataset, our smallest model (5M parameters) achieves $97.9$% alignment accuracy, using $172\times$ fewer parameters, $371$% less memory, and offering $36\times$ faster inference than the current 860M-parameter state-of-the-art model ($93.9$% accuracy). The larger model (131M parameters) reaches nearly $99$% accuracy while requiring $6.5\times$ fewer parameters. These results highlight the scalability and effectiveness of our approach. The code is available at https://bit.ly/mdsgen.

MDSGen: Fast and Efficient Masked Diffusion Temporal-Aware Transformers for Open-Domain Sound Generation

TL;DR

MDSGen tackles vision-guided open-domain sound generation by replacing heavy Unet backbones with a lightweight denoising diffusion Transformer conditioned on a learned video representation. Key innovations include a Reducer that eliminates redundant video features and a Temporal-Aware Masking strategy to exploit audio temporal structure, enabling strong alignment with far fewer parameters. On VGGSound, a 5M-parameter Tiny model achieves 97.9% alignment accuracy and orders-of-magnitude efficiency gains over 860M-parameter baselines, while a 131M Base model nears 99% alignment, demonstrating scalability. The method also generalizes to Flickr-SoundNet with competitive cross-modal metrics, and ablations validate the benefits of TAM, Reducer, and guidance strategies, highlighting practical impact for fast, resource-efficient video-to-audio generation.)

Abstract

We introduce MDSGen, a novel framework for vision-guided open-domain sound generation optimized for model parameter size, memory consumption, and inference speed. This framework incorporates two key innovations: (1) a redundant video feature removal module that filters out unnecessary visual information, and (2) a temporal-aware masking strategy that leverages temporal context for enhanced audio generation accuracy. In contrast to existing resource-heavy Unet-based models, \texttt{MDSGen} employs denoising masked diffusion transformers, facilitating efficient generation without reliance on pre-trained diffusion models. Evaluated on the benchmark VGGSound dataset, our smallest model (5M parameters) achieves % alignment accuracy, using fewer parameters, % less memory, and offering faster inference than the current 860M-parameter state-of-the-art model (% accuracy). The larger model (131M parameters) reaches nearly % accuracy while requiring fewer parameters. These results highlight the scalability and effectiveness of our approach. The code is available at https://bit.ly/mdsgen.
Paper Structure (40 sections, 1 equation, 18 figures, 17 tables)

This paper contains 40 sections, 1 equation, 18 figures, 17 tables.

Figures (18)

  • Figure 1: Aligment Score. Comparison with SOTA audio generation methods on the VGGSound test set. The diameter of each circle represents the memory usage during inference.
  • Figure 2: Overview of the proposed highly-efficient MDSGen framework, utilizing denoising masked diffusion transformers to efficiently learn video-conditional distributions for audio generation, replacing traditional Unet-based methods. The fire icon denotes trainable modules, and the locked icon denotes frozen ones. Green arrows $\color{green}\rightarrow$ denote branches used only during training, blue arrows $\color{blue}\rightarrow$ are for only inference, and black arrows $\rightarrow$ are used in both training and inference.
  • Figure 3: Audio Masking Strategies. Here, the red square red-square is the learnable mask token.
  • Figure 4: Confidence Scores. Compared to FoleyCrafter (left) and Diff-Foley (middle), our method (right) produces many more audio samples with higher confidence that align with their corresponding videos on the VGGSound test set ($\sim 15$k samples).
  • Figure 5: Learned Weights of Reducer. Comparison of our three models.
  • ...and 13 more figures