Table of Contents
Fetching ...

Physics-Informed Neural Networks for Speech Production

Kazuya Yokota, Ryosuke Harakawa, Masaaki Baba, Masahiro Iwahashi

TL;DR

The paper tackles the challenge of solving coupled vocal-fold and vocal-tract dynamics from speech signals by introducing physics-informed neural networks that embed the governing equations of the Ishizaka–Flanagan two-mass model and a 1D vocal-tract model. It advances the method with a differentiable approximation for glottal closure, a time-scaling strategy that learns the unknown self-oscillation period $T$, and a hard-constraint coupling that links glottal flow to tract acoustics, enabling both forward vowel synthesis and inverse estimation of glottal states from speech. The method is demonstrated through forward analysis of vowels and inverse analysis of subglottal pressure, achieving close agreement with conventional solvers while reducing complexity and enabling mesh-free computation. The results suggest PINNs offer a versatile, nonlinear, and scalable framework for speech production analysis with potential extensions to higher dimensions and broader phonetic content, including consonants and singing.

Abstract

The analysis of speech production based on physical models of the vocal folds and vocal tract is essential for studies on vocal-fold behavior and linguistic research. This paper proposes a speech production analysis method using physics-informed neural networks (PINNs). The networks are trained directly on the governing equations of vocal-fold vibration and vocal-tract acoustics. Vocal-fold collisions introduce nondifferentiability and vanishing gradients, challenging phenomena for PINNs. We demonstrate, however, that introducing a differentiable approximation function enables the analysis of vocal-fold vibrations within the PINN framework. The period of self-excited vocal-fold vibration is generally unknown. We show that by treating the period as a learnable network parameter, a periodic solution can be obtained. Furthermore, by implementing the coupling between glottal flow and vocal-tract acoustics as a hard constraint, glottis-tract interaction is achieved without additional loss terms. We confirmed the method's validity through forward and inverse analyses, demonstrating that the glottal flow rate, vocal-fold vibratory state, and subglottal pressure can be simultaneously estimated from speech signals. Notably, the same network architecture can be applied to both forward and inverse analyses, highlighting the versatility of this approach. The proposed method inherits the advantages of PINNs, including mesh-free computation and the natural incorporation of nonlinearities, and thus holds promise for a wide range of applications.

Physics-Informed Neural Networks for Speech Production

TL;DR

The paper tackles the challenge of solving coupled vocal-fold and vocal-tract dynamics from speech signals by introducing physics-informed neural networks that embed the governing equations of the Ishizaka–Flanagan two-mass model and a 1D vocal-tract model. It advances the method with a differentiable approximation for glottal closure, a time-scaling strategy that learns the unknown self-oscillation period , and a hard-constraint coupling that links glottal flow to tract acoustics, enabling both forward vowel synthesis and inverse estimation of glottal states from speech. The method is demonstrated through forward analysis of vowels and inverse analysis of subglottal pressure, achieving close agreement with conventional solvers while reducing complexity and enabling mesh-free computation. The results suggest PINNs offer a versatile, nonlinear, and scalable framework for speech production analysis with potential extensions to higher dimensions and broader phonetic content, including consonants and singing.

Abstract

The analysis of speech production based on physical models of the vocal folds and vocal tract is essential for studies on vocal-fold behavior and linguistic research. This paper proposes a speech production analysis method using physics-informed neural networks (PINNs). The networks are trained directly on the governing equations of vocal-fold vibration and vocal-tract acoustics. Vocal-fold collisions introduce nondifferentiability and vanishing gradients, challenging phenomena for PINNs. We demonstrate, however, that introducing a differentiable approximation function enables the analysis of vocal-fold vibrations within the PINN framework. The period of self-excited vocal-fold vibration is generally unknown. We show that by treating the period as a learnable network parameter, a periodic solution can be obtained. Furthermore, by implementing the coupling between glottal flow and vocal-tract acoustics as a hard constraint, glottis-tract interaction is achieved without additional loss terms. We confirmed the method's validity through forward and inverse analyses, demonstrating that the glottal flow rate, vocal-fold vibratory state, and subglottal pressure can be simultaneously estimated from speech signals. Notably, the same network architecture can be applied to both forward and inverse analyses, highlighting the versatility of this approach. The proposed method inherits the advantages of PINNs, including mesh-free computation and the natural incorporation of nonlinearities, and thus holds promise for a wide range of applications.

Paper Structure

This paper contains 19 sections, 37 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Vocal-fold and vocal-tract models used in this study. The vocal folds are represented by the Ishizaka–Flanagan two-mass model Ishizaka, and the vocal tract is represented by a one-dimensional acoustic tube model VT_model.
  • Figure 2: Proposed PINN architecture for speech production. The upper network predicts the vocal-fold displacements, while the lower network predicts the sound pressure and volume velocity in the vocal tract. Coupled analysis is achieved by exchanging the pressure and volume velocity at $x=0$ between the two networks during the loss function calculation.
  • Figure 3: Function approximation using differentiable functions. (a) Approximation of glottal area represented by Eq. (\ref{['Eq_Ag_softplus']}). (b) Approximation of step function represented by Eq. (\ref{['Eq_sigmoid']}).
  • Figure 4: Vocal-tract cross-sectional area functions. In this study, the shapes of /a/ and /u/ reported by Arai linguistic1 were interpolated using PCHIP method PCHIP1PCHIP2.
  • Figure 5: Epoch-wise variation of relative error of the period $T$ estimated by proposed method with respect to the reference value. It can be seen that, starting from an initial error of 20%, the estimated period converges to the true value after approximately 2,000 epochs.
  • ...and 5 more figures