Learning Vocal-Tract Area and Radiation with a Physics-Informed Webster Model

Minhui Lu; Joshua D. Reiss

Learning Vocal-Tract Area and Radiation with a Physics-Informed Webster Model

Minhui Lu, Joshua D. Reiss

Abstract

We present a physics-informed voiced backend renderer for singing-voice synthesis. Given synthetic single-channel audio and a fund-amental--frequency trajectory, we train a time-domain Webster model as a physics-informed neural network to estimate an interpretable vocal-tract area function and an open-end radiation coefficient. Training enforces partial differential equation and boundary consistency; a lightweight DDSP path is used only to stabilize learning, while inference is purely physics-based. On sustained vowels (/a/, /i/, /u/), parameters rendered by an independent finite-difference time-domain Webster solver reproduce spectral envelopes competitively with a compact DDSP baseline and remain stable under changes in discretization, moderate source variations, and about ten percent pitch shifts. The in-graph waveform remains breathier than the reference, motivating periodicity-aware objectives and explicit glottal priors in future work.

Learning Vocal-Tract Area and Radiation with a Physics-Informed Webster Model

Abstract

Paper Structure (15 sections, 7 equations, 2 figures, 4 tables)

This paper contains 15 sections, 7 equations, 2 figures, 4 tables.

Introduction
Physics-informed Voiced Renderer
Governing equations and boundary conditions
Physics losses
Differentiable audio and probes
Auxiliary DDSP renderer (training only)
Overall objective
Training and Evaluation Protocol
Results
Post-render validation: recovered controls transfer beyond the training graph
The periodicity gap: in-graph rendering is systematically more aperiodic
Learned $A(x)$ and $\zeta$: transferable controls but not uniquely identifiable under steady vowels
Robustness to controlled mismatches
Audio examples and qualitative observations
Conclusion

Figures (2)

Figure 1: Overview of the physics-informed voiced renderer. DualNet predicts $(\psi,\hat{A},\hat{\zeta})$ and a differentiable Webster rendering path produces $\hat{y}(t)$ for reference-based losses during training (inference is physics-only). Solid arrows denote forward signal flow in the renderer; dashed arrows denote training-only loss/backprop connections (e.g., using $y(t)$), which are removed at inference. For solver-independent evaluation (not shown), $(\hat{A},\hat{\zeta})$ are exported to an independent FDTD--Webster solver for post-render assessment.
Figure 2: Recovered area functions $\hat{A}(x)$ (normalised units). Here $x$ increases from the glottis $(0)$ to the lips $(1)$. The solutions capture broad vowel-dependent trends (e.g., anterior constriction for /i/ and a narrower mouth end for /u/), while fine-scale details remain ambiguous under single-channel steady voiced supervision.

Learning Vocal-Tract Area and Radiation with a Physics-Informed Webster Model

Abstract

Learning Vocal-Tract Area and Radiation with a Physics-Informed Webster Model

Authors

Abstract

Table of Contents

Figures (2)