Table of Contents
Fetching ...

Learning Vocal-Tract Area and Radiation with a Physics-Informed Webster Model

Minhui Lu, Joshua D. Reiss

Abstract

We present a physics-informed voiced backend renderer for singing-voice synthesis. Given synthetic single-channel audio and a fund-amental--frequency trajectory, we train a time-domain Webster model as a physics-informed neural network to estimate an interpretable vocal-tract area function and an open-end radiation coefficient. Training enforces partial differential equation and boundary consistency; a lightweight DDSP path is used only to stabilize learning, while inference is purely physics-based. On sustained vowels (/a/, /i/, /u/), parameters rendered by an independent finite-difference time-domain Webster solver reproduce spectral envelopes competitively with a compact DDSP baseline and remain stable under changes in discretization, moderate source variations, and about ten percent pitch shifts. The in-graph waveform remains breathier than the reference, motivating periodicity-aware objectives and explicit glottal priors in future work.

Learning Vocal-Tract Area and Radiation with a Physics-Informed Webster Model

Abstract

We present a physics-informed voiced backend renderer for singing-voice synthesis. Given synthetic single-channel audio and a fund-amental--frequency trajectory, we train a time-domain Webster model as a physics-informed neural network to estimate an interpretable vocal-tract area function and an open-end radiation coefficient. Training enforces partial differential equation and boundary consistency; a lightweight DDSP path is used only to stabilize learning, while inference is purely physics-based. On sustained vowels (/a/, /i/, /u/), parameters rendered by an independent finite-difference time-domain Webster solver reproduce spectral envelopes competitively with a compact DDSP baseline and remain stable under changes in discretization, moderate source variations, and about ten percent pitch shifts. The in-graph waveform remains breathier than the reference, motivating periodicity-aware objectives and explicit glottal priors in future work.
Paper Structure (15 sections, 7 equations, 2 figures, 4 tables)

This paper contains 15 sections, 7 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overview of the physics-informed voiced renderer. DualNet predicts $(\psi,\hat{A},\hat{\zeta})$ and a differentiable Webster rendering path produces $\hat{y}(t)$ for reference-based losses during training (inference is physics-only). Solid arrows denote forward signal flow in the renderer; dashed arrows denote training-only loss/backprop connections (e.g., using $y(t)$), which are removed at inference. For solver-independent evaluation (not shown), $(\hat{A},\hat{\zeta})$ are exported to an independent FDTD--Webster solver for post-render assessment.
  • Figure 2: Recovered area functions $\hat{A}(x)$ (normalised units). Here $x$ increases from the glottis $(0)$ to the lips $(1)$. The solutions capture broad vowel-dependent trends (e.g., anterior constriction for /i/ and a narrower mouth end for /u/), while fine-scale details remain ambiguous under single-channel steady voiced supervision.