Table of Contents
Fetching ...

Dysarthria Normalization via Local Lie Group Transformations for Robust ASR

Mikhail Osipov

TL;DR

This work treats dysarthric speech as a structured, local deformation of healthy speech in the time–frequency domain by modeling distortions as Lie-group actions on spectrograms. A U-Net predicts spatially smooth transformation fields for time, frequency, and amplitude, which are then inverted to normalize the input prior to ASR, with training driven by synthetic distortions and a spontaneous-symmetry-breaking (SSB) loss to avoid trivial solutions. The approach yields zero-shot improvements on real dysarthric datasets (TORGO and UA-Speech), reducing WER by up to 17 percentage points and lowering WER variance, while preserving performance on clean speech and improving phoneme/character error rates. The method demonstrates interpretable, geometry-aware front-ends that generalize across ASR backends and offers a flexible, extensible framework for incorporating physical or articulatory priors in robust speech recognition.

Abstract

We present a geometry-driven method for normalizing dysarthric speech by modeling time, frequency, and amplitude distortions as smooth, local Lie group transformations of spectrograms. Scalar fields generate these deformations via exponential maps, and a neural network is trained - using only synthetically warped healthy speech - to infer the fields and apply an approximate inverse at test time. We introduce a spontaneous-symmetry-breaking (SSB) potential that encourages the model to discover non-trivial field configurations. On real pathological speech, the system delivers consistent gains: up to 17 percentage-point WER reduction on challenging TORGO utterances and a 16 percent drop in WER variance, with no degradation on clean CommonVoice data. Character and phoneme error rates improve in parallel, confirming linguistic relevance. Our results demonstrate that geometrically structured warping provides consistent, zero-shot robustness gains for dysarthric ASR.

Dysarthria Normalization via Local Lie Group Transformations for Robust ASR

TL;DR

This work treats dysarthric speech as a structured, local deformation of healthy speech in the time–frequency domain by modeling distortions as Lie-group actions on spectrograms. A U-Net predicts spatially smooth transformation fields for time, frequency, and amplitude, which are then inverted to normalize the input prior to ASR, with training driven by synthetic distortions and a spontaneous-symmetry-breaking (SSB) loss to avoid trivial solutions. The approach yields zero-shot improvements on real dysarthric datasets (TORGO and UA-Speech), reducing WER by up to 17 percentage points and lowering WER variance, while preserving performance on clean speech and improving phoneme/character error rates. The method demonstrates interpretable, geometry-aware front-ends that generalize across ASR backends and offers a flexible, extensible framework for incorporating physical or articulatory priors in robust speech recognition.

Abstract

We present a geometry-driven method for normalizing dysarthric speech by modeling time, frequency, and amplitude distortions as smooth, local Lie group transformations of spectrograms. Scalar fields generate these deformations via exponential maps, and a neural network is trained - using only synthetically warped healthy speech - to infer the fields and apply an approximate inverse at test time. We introduce a spontaneous-symmetry-breaking (SSB) potential that encourages the model to discover non-trivial field configurations. On real pathological speech, the system delivers consistent gains: up to 17 percentage-point WER reduction on challenging TORGO utterances and a 16 percent drop in WER variance, with no degradation on clean CommonVoice data. Character and phoneme error rates improve in parallel, confirming linguistic relevance. Our results demonstrate that geometrically structured warping provides consistent, zero-shot robustness gains for dysarthric ASR.

Paper Structure

This paper contains 50 sections, 15 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Examples of generated fields: $\phi_{\text{time}}$, $\phi_{\text{freq}}$ (before masking), $\phi_{\text{amp}}$ (masked)
  • Figure 2: Training and validation loss dynamics (10000 samples from CommonVoice dataset. Model version v.2, batch size = 16, learning rate starts from $3 e^{-5}$). The $\varepsilon$ parameter grows linearly on warmup stage, plateaus and then grows linearly until the end of training
  • Figure 3: Scaled loss function terms and $\varepsilon$ dynamics across training steps. Model version v.2
  • Figure 4: Training and validation loss dynamics (10000 samples from CommonVoice dataset. Model version v.1, batch size = 32, learning rate starts from $3 e^{-5}$). The $\varepsilon$ parameter grows linearly on warmup stage, plateaus and then grows linearly until the end of training
  • Figure 5: Weighted loss function terms and $\varepsilon$ dynamics across training steps. Model version v.1
  • ...and 1 more figures