Table of Contents
Fetching ...

Learning Control of Neural Sound Effects Synthesis from Physically Inspired Models

Yisu Zong, Joshua Reiss

TL;DR

The paper tackles the challenge of achieving both realism and intuitive control in real-time sound effects. It proposes a two-stage neural framework guided by a physically inspired explosion representation, using FiLM conditioning and a latent discriminator to obtain disentangled control, and explores two transfer strategies—supervised pseudo-labeling and unsupervised CycleGAN—for aligning synthetic sounds with real-world audio. Results show that supervised transfer delivers strong control fidelity within the PM parameter range, while unsupervised transfer provides robust audio quality across a broader parameter space, illustrating a practical pathway to fuse physics priors with neural synthesis for tunable, high-fidelity sound design. This work has potential impact for game audio, film post-production, and real-time sound design where both controllability and realism are crucial.

Abstract

Sound effects model design commonly uses digital signal processing techniques with full control ability, but it is difficult to achieve realism within a limited number of parameters. Recently, neural sound effects synthesis methods have emerged as a promising approach for generating high-quality and realistic sounds, but the process of synthesizing the desired sound poses difficulties in terms of control. This paper presents a real-time neural synthesis model guided by a physically inspired model, enabling the generation of high-quality sounds while inheriting the control interface of the physically inspired model. We showcase the superior performance of our model in terms of sound quality and control.

Learning Control of Neural Sound Effects Synthesis from Physically Inspired Models

TL;DR

The paper tackles the challenge of achieving both realism and intuitive control in real-time sound effects. It proposes a two-stage neural framework guided by a physically inspired explosion representation, using FiLM conditioning and a latent discriminator to obtain disentangled control, and explores two transfer strategies—supervised pseudo-labeling and unsupervised CycleGAN—for aligning synthetic sounds with real-world audio. Results show that supervised transfer delivers strong control fidelity within the PM parameter range, while unsupervised transfer provides robust audio quality across a broader parameter space, illustrating a practical pathway to fuse physics priors with neural synthesis for tunable, high-fidelity sound design. This work has potential impact for game audio, film post-production, and real-time sound design where both controllability and realism are crucial.

Abstract

Sound effects model design commonly uses digital signal processing techniques with full control ability, but it is difficult to achieve realism within a limited number of parameters. Recently, neural sound effects synthesis methods have emerged as a promising approach for generating high-quality and realistic sounds, but the process of synthesizing the desired sound poses difficulties in terms of control. This paper presents a real-time neural synthesis model guided by a physically inspired model, enabling the generation of high-quality sounds while inheriting the control interface of the physically inspired model. We showcase the superior performance of our model in terms of sound quality and control.

Paper Structure

This paper contains 14 sections, 6 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Flow diagram of proposed methods. Grey boxes represent frozen networks. (a). Representation learning stage for synthesized sounds $x_g$ by the PM, achieving disentangled control facilitated by the latent discriminator. (b). Supervised transfer from $x_g$ to real-world sounds $x_r$ using pseudo-parameters obtained by a pre-trained parameter estimation model. (c). Unsupervised transfer from $x_g$ to $x_r$ by CycleGAN. (d). Utilization of the proposed model. Control parameters and their corresponding $x_g$ as inputs of the model to obtain $x_r$.