Table of Contents
Fetching ...

Few-shot Acoustic Synthesis with Multimodal Flow Matching

Amandine Brunetto

Abstract

Generating audio that is acoustically consistent with a scene is essential for immersive virtual environments. Recent neural acoustic field methods enable spatially continuous sound rendering but remain scene-specific, requiring dense audio measurements and costly training for each environment. Few-shot approaches improve scalability across rooms but still rely on multiple recordings and, being deterministic, fail to capture the inherent uncertainty of scene acoustics under sparse context. We introduce flow-matching acoustic generation (FLAC), a probabilistic method for few-shot acoustic synthesis that models the distribution of plausible room impulse responses (RIRs) given minimal scene context. FLAC leverages a diffusion transformer trained with a flow-matching objective to generate RIRs at arbitrary positions in novel scenes, conditioned on spatial, geometric, and acoustic cues. FLAC outperforms state-of-the-art eight-shot baselines with one-shot on both the AcousticRooms and Hearing Anything Anywhere datasets. To complement standard perceptual metrics, we further introduce AGREE, a joint acoustic-geometry embedding, enabling geometry-consistent evaluation of generated RIRs through retrieval and distributional metrics. This work is the first to apply generative flow matching to explicit RIR synthesis, establishing a new direction for robust and data-efficient acoustic synthesis.

Few-shot Acoustic Synthesis with Multimodal Flow Matching

Abstract

Generating audio that is acoustically consistent with a scene is essential for immersive virtual environments. Recent neural acoustic field methods enable spatially continuous sound rendering but remain scene-specific, requiring dense audio measurements and costly training for each environment. Few-shot approaches improve scalability across rooms but still rely on multiple recordings and, being deterministic, fail to capture the inherent uncertainty of scene acoustics under sparse context. We introduce flow-matching acoustic generation (FLAC), a probabilistic method for few-shot acoustic synthesis that models the distribution of plausible room impulse responses (RIRs) given minimal scene context. FLAC leverages a diffusion transformer trained with a flow-matching objective to generate RIRs at arbitrary positions in novel scenes, conditioned on spatial, geometric, and acoustic cues. FLAC outperforms state-of-the-art eight-shot baselines with one-shot on both the AcousticRooms and Hearing Anything Anywhere datasets. To complement standard perceptual metrics, we further introduce AGREE, a joint acoustic-geometry embedding, enabling geometry-consistent evaluation of generated RIRs through retrieval and distributional metrics. This work is the first to apply generative flow matching to explicit RIR synthesis, establishing a new direction for robust and data-efficient acoustic synthesis.
Paper Structure (89 sections, 19 equations, 16 figures, 15 tables)

This paper contains 89 sections, 19 equations, 16 figures, 15 tables.

Figures (16)

  • Figure 1: Few-shot flow-matching acoustic synthesis (FLAC) and scene-consistency evaluation: Given a few-shot multimodal context $\tau$, including a depth map, an acoustic observation, and sensor poses, FLAC uses a diffusion transformer trained with flow matching to generate room impulse responses (RIRs) in novel rooms. Unlike prior deterministic approaches, FLAC models the distribution of plausible RIRs under sparse scene context, capturing acoustic uncertainty. Even with one shot, FLAC outperforms 8-shot state-of-the-art methods. To assess generation quality, we introduce AGREE, a CLIP-style audio-geometry embedding that aligns both modalities in a shared latent space, enabling scene-consistency evaluation through retrieval and distributional metrics.
  • Figure 2: Training and inference pipelines of FLAC: During training, a pre-trained VAE encodes ground-truth RIRs into latents $\mathbf{z}_0$. Latents are linearly interpolated with noise to form $\mathbf{z}_t$. A DiT is trained to predict the velocity $\widehat{\mathbf{v}}_t$ that transports $\mathbf{z}_t$ toward the original data distribution. At inference, RIRs are generated from random noise, guided by the few-shot spatial, geometric and acoustic context.
  • Figure 3: FLAC diffusion transformer: The noise timestep $t$ and the target RIR pose are injected via AdaLN. Acoustic, spatial and geometric context are provided through cross-attention.
  • Figure 4: AGREE contrastive framework: Audio and geometry inputs are encoded into a shared latent space, where a contrastive objective maximizes similarity for matching pairs (diagonal entries) and minimizes it for mismatched ones.
  • Figure 5: Impact of classifier-free guidance and inference steps: Evolution of T60 and FD$_G$ as a function of the guidance scale $\omega$ and the number of timesteps.
  • ...and 11 more figures