Dysarthria Normalization via Local Lie Group Transformations for Robust ASR
Mikhail Osipov
TL;DR
This work treats dysarthric speech as a structured, local deformation of healthy speech in the time–frequency domain by modeling distortions as Lie-group actions on spectrograms. A U-Net predicts spatially smooth transformation fields for time, frequency, and amplitude, which are then inverted to normalize the input prior to ASR, with training driven by synthetic distortions and a spontaneous-symmetry-breaking (SSB) loss to avoid trivial solutions. The approach yields zero-shot improvements on real dysarthric datasets (TORGO and UA-Speech), reducing WER by up to 17 percentage points and lowering WER variance, while preserving performance on clean speech and improving phoneme/character error rates. The method demonstrates interpretable, geometry-aware front-ends that generalize across ASR backends and offers a flexible, extensible framework for incorporating physical or articulatory priors in robust speech recognition.
Abstract
We present a geometry-driven method for normalizing dysarthric speech by modeling time, frequency, and amplitude distortions as smooth, local Lie group transformations of spectrograms. Scalar fields generate these deformations via exponential maps, and a neural network is trained - using only synthetically warped healthy speech - to infer the fields and apply an approximate inverse at test time. We introduce a spontaneous-symmetry-breaking (SSB) potential that encourages the model to discover non-trivial field configurations. On real pathological speech, the system delivers consistent gains: up to 17 percentage-point WER reduction on challenging TORGO utterances and a 16 percent drop in WER variance, with no degradation on clean CommonVoice data. Character and phoneme error rates improve in parallel, confirming linguistic relevance. Our results demonstrate that geometrically structured warping provides consistent, zero-shot robustness gains for dysarthric ASR.
