Developing Acoustic Models for Automatic Speech Recognition in Swedish
Giampiero Salvi
TL;DR
This work addresses robust, speaker-independent automatic speech recognition for Swedish over telephone lines by building HMM-based acoustic models trained on the SpeechDat Swedish corpus. It systematically evaluates monophone and triphone architectures, with within-word and cross-word context expansions, incorporating noise and boundary models and exploring retroflex allophones. The best result reaches 88.6% accuracy using within-word context-expanded triphones with eight Gaussian mixtures, while showing that data size and task characteristics limit gains from cross-word context or retroflex variants. The study demonstrates model flexibility across related datasets (Waxholm and Norwegian SpeechDat) and outlines concrete avenues for improving performance via more data, dialect-aware modeling, and noise mitigation, highlighting practical potential for Swedish ASR in varied applications.
Abstract
This paper is concerned with automatic continuous speech recognition using trainable systems. The aim of this work is to build acoustic models for spoken Swedish. This is done employing hidden Markov models and using the SpeechDat database to train their parameters. Acoustic modeling has been worked out at a phonetic level, allowing general speech recognition applications, even though a simplified task (digits and natural number recognition) has been considered for model evaluation. Different kinds of phone models have been tested, including context independent models and two variations of context dependent models. Furthermore many experiments have been done with bigram language models to tune some of the system parameters. System performance over various speaker subsets with different sex, age and dialect has also been examined. Results are compared to previous similar studies showing a remarkable improvement.
