DiffSound: Differentiable Modal Sound Rendering and Inverse Rendering for Diverse Inference Tasks

Xutong Jin; Chenxi Xu; Ruohan Gao; Jiajun Wu; Guoping Wang; Sheng Li

DiffSound: Differentiable Modal Sound Rendering and Inverse Rendering for Diverse Inference Tasks

Xutong Jin, Chenxi Xu, Ruohan Gao, Jiajun Wu, Guoping Wang, Sheng Li

TL;DR

DiffSound is proposed, a differentiable sound rendering framework for physics-based modal sound synthesis, which is based on an implicit shape representation, a new high-order finite element analysis module, and a differentiable audio synthesizer.

Abstract

Accurately estimating and simulating the physical properties of objects from real-world sound recordings is of great practical importance in the fields of vision, graphics, and robotics. However, the progress in these directions has been limited -- prior differentiable rigid or soft body simulation techniques cannot be directly applied to modal sound synthesis due to the high sampling rate of audio, while previous audio synthesizers often do not fully model the accurate physical properties of the sounding objects. We propose DiffSound, a differentiable sound rendering framework for physics-based modal sound synthesis, which is based on an implicit shape representation, a new high-order finite element analysis module, and a differentiable audio synthesizer. Our framework can solve a wide range of inverse problems thanks to the differentiability of the entire pipeline, including physical parameter estimation, geometric shape reasoning, and impact position prediction. Experimental results demonstrate the effectiveness of our approach, highlighting its ability to accurately reproduce the target sound in a physics-based manner. DiffSound serves as a valuable tool for various sound synthesis and analysis applications.

DiffSound: Differentiable Modal Sound Rendering and Inverse Rendering for Diverse Inference Tasks

TL;DR

Abstract

Paper Structure (28 sections, 18 equations, 9 figures, 2 tables)

This paper contains 28 sections, 18 equations, 9 figures, 2 tables.

Introduction
Related work
Modal Sound Synthesis
High-Order FEM
Differentiable Simulation
Differentiable Modal Sound Rendering
Method Overview
Differentiable Tetrahedral Representation
Implicit Neural Representation
Implicit to Explicit Representation
Differentiable High-order FEM
Mass and Stiffness Matrix
Eigenvalue Decomposition
Loss Function for Optimization
Differentiable Additive Synthesizer
...and 13 more sections

Figures (9)

Figure 1: Our DiffSound differentiable simulation and inverse rendering pipeline. The differentiable tetrahedral mesh representation is employed to directly optimize the topology of a tetrahedral mesh. Subsequently, a differentiable high-order finite element analysis module is utilized to analyze the vibration frequencies of the tetrahedral mesh. Finally, a differentiable additive synthesizer is used to produce the impact sound with a hybrid loss function for optimizing all learnable modules. The learnable parameters, indicated by blue boxes, control module outputs in our differentiable framework. This enables gradient computation for hybrid loss, facilitating parameter optimization.
Figure 2: Five configurations of the interface between background tetrahedrons and internal ones. If the internal subregion is more complex than a tetrahedron, it should be subdivided into smaller tetrahedrons.
Figure 3: Ablation study on loss functions. We show the spectrograms, scaling factors of eigenvalues, and RMSE in different setups. Across all setups, our hybrid loss function consistently outperforms the one using only the multi-scale L1 loss or optimal transport-based loss.
Figure 4: Visualization of the surface likelihood distribution (probability heatmap) of the impact position on the object's surface for an example object. The predicted positions are considered reasonable and accurate if they fall within the region that is rotationally symmetric about the central axis relative to the groundtruth.
Figure 5: Training process of estimating the damping curve. We utilize 256 initial modes to comprehensively cover all target modes. After training, degraded modes are subsequently removed.
...and 4 more figures

DiffSound: Differentiable Modal Sound Rendering and Inverse Rendering for Diverse Inference Tasks

TL;DR

Abstract

DiffSound: Differentiable Modal Sound Rendering and Inverse Rendering for Diverse Inference Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (9)