Table of Contents
Fetching ...

MambaFoley: Foley Sound Generation using Selective State-Space Models

Marco Furio Colombo, Francesca Ronchini, Luca Comanducci, Fabio Antonacci

TL;DR

This paper tackles Foley sound synthesis by addressing temporal control within diffusion-based generation. It proposes MambaFoley, which embeds the selective State-Space Model Mamba into the bottleneck of a DAG-based diffusion U-Net and uses RMS-based temporal conditioning via BlockFiLM to steer the temporal evolution of sounds. Through experiments on the DCASE Foley dataset, it compares against T-Foley and AttentionFoley using objective metrics (FAD and E-L1) and subjective MOS judgments, reporting faster inference and improved FAD alongside strong perceptual quality. The results support the viability of selective SSMs for high-fidelity, temporally coherent Foley generation and point to further optimization opportunities in both architecture and conditioning strategies.

Abstract

Recent advancements in deep learning have led to widespread use of techniques for audio content generation, notably employing Denoising Diffusion Probabilistic Models (DDPM) across various tasks. Among these, Foley Sound Synthesis is of particular interest for its role in applications for the creation of multimedia content. Given the temporal-dependent nature of sound, it is crucial to design generative models that can effectively handle the sequential modeling of audio samples. Selective State Space Models (SSMs) have recently been proposed as a valid alternative to previously proposed techniques, demonstrating competitive performance with lower computational complexity. In this paper, we introduce MambaFoley, a diffusion-based model that, to the best of our knowledge, is the first to leverage the recently proposed SSM known as Mamba for the Foley sound generation task. To evaluate the effectiveness of the proposed method, we compare it with a state-of-the-art Foley sound generative model using both objective and subjective analyses.

MambaFoley: Foley Sound Generation using Selective State-Space Models

TL;DR

This paper tackles Foley sound synthesis by addressing temporal control within diffusion-based generation. It proposes MambaFoley, which embeds the selective State-Space Model Mamba into the bottleneck of a DAG-based diffusion U-Net and uses RMS-based temporal conditioning via BlockFiLM to steer the temporal evolution of sounds. Through experiments on the DCASE Foley dataset, it compares against T-Foley and AttentionFoley using objective metrics (FAD and E-L1) and subjective MOS judgments, reporting faster inference and improved FAD alongside strong perceptual quality. The results support the viability of selective SSMs for high-fidelity, temporally coherent Foley generation and point to further optimization opportunities in both architecture and conditioning strategies.

Abstract

Recent advancements in deep learning have led to widespread use of techniques for audio content generation, notably employing Denoising Diffusion Probabilistic Models (DDPM) across various tasks. Among these, Foley Sound Synthesis is of particular interest for its role in applications for the creation of multimedia content. Given the temporal-dependent nature of sound, it is crucial to design generative models that can effectively handle the sequential modeling of audio samples. Selective State Space Models (SSMs) have recently been proposed as a valid alternative to previously proposed techniques, demonstrating competitive performance with lower computational complexity. In this paper, we introduce MambaFoley, a diffusion-based model that, to the best of our knowledge, is the first to leverage the recently proposed SSM known as Mamba for the Foley sound generation task. To evaluate the effectiveness of the proposed method, we compare it with a state-of-the-art Foley sound generative model using both objective and subjective analyses.
Paper Structure (15 sections, 7 equations, 4 figures, 2 tables)

This paper contains 15 sections, 7 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Schematic representation of MambaFoley inference procedure. The model generates a raw audio waveform based on the desired input category and temporal profile.
  • Figure 2: Schematic representation of the U-Net architecture used during the backward part of the diffusion process.
  • Figure 3: Layers of MambaFoley: GBlock (a) and bidirectional Mamba bottleneck (b), $\mathbf{f}_{in}$ and $\mathbf{f}_{out}$ represent the generic input and output feature of the layers, respectively.
  • Figure 4: Example spectrograms (c-e) of samples generated using different models. The category conditioning corresponds to the class gunshot, while the temporal conditioning is provided via the RMS shown in (a) and computed over the ground truth shown in (b).