Physics-Informed Neural Networks for Speech Production

Kazuya Yokota; Ryosuke Harakawa; Masaaki Baba; Masahiro Iwahashi

Physics-Informed Neural Networks for Speech Production

Kazuya Yokota, Ryosuke Harakawa, Masaaki Baba, Masahiro Iwahashi

TL;DR

The paper tackles the challenge of solving coupled vocal-fold and vocal-tract dynamics from speech signals by introducing physics-informed neural networks that embed the governing equations of the Ishizaka–Flanagan two-mass model and a 1D vocal-tract model. It advances the method with a differentiable approximation for glottal closure, a time-scaling strategy that learns the unknown self-oscillation period $T$, and a hard-constraint coupling that links glottal flow to tract acoustics, enabling both forward vowel synthesis and inverse estimation of glottal states from speech. The method is demonstrated through forward analysis of vowels and inverse analysis of subglottal pressure, achieving close agreement with conventional solvers while reducing complexity and enabling mesh-free computation. The results suggest PINNs offer a versatile, nonlinear, and scalable framework for speech production analysis with potential extensions to higher dimensions and broader phonetic content, including consonants and singing.

Abstract

The analysis of speech production based on physical models of the vocal folds and vocal tract is essential for studies on vocal-fold behavior and linguistic research. This paper proposes a speech production analysis method using physics-informed neural networks (PINNs). The networks are trained directly on the governing equations of vocal-fold vibration and vocal-tract acoustics. Vocal-fold collisions introduce nondifferentiability and vanishing gradients, challenging phenomena for PINNs. We demonstrate, however, that introducing a differentiable approximation function enables the analysis of vocal-fold vibrations within the PINN framework. The period of self-excited vocal-fold vibration is generally unknown. We show that by treating the period as a learnable network parameter, a periodic solution can be obtained. Furthermore, by implementing the coupling between glottal flow and vocal-tract acoustics as a hard constraint, glottis-tract interaction is achieved without additional loss terms. We confirmed the method's validity through forward and inverse analyses, demonstrating that the glottal flow rate, vocal-fold vibratory state, and subglottal pressure can be simultaneously estimated from speech signals. Notably, the same network architecture can be applied to both forward and inverse analyses, highlighting the versatility of this approach. The proposed method inherits the advantages of PINNs, including mesh-free computation and the natural incorporation of nonlinearities, and thus holds promise for a wide range of applications.

Physics-Informed Neural Networks for Speech Production

TL;DR

Abstract

Physics-Informed Neural Networks for Speech Production

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)