Table of Contents
Fetching ...

Developing Acoustic Models for Automatic Speech Recognition in Swedish

Giampiero Salvi

TL;DR

This work addresses robust, speaker-independent automatic speech recognition for Swedish over telephone lines by building HMM-based acoustic models trained on the SpeechDat Swedish corpus. It systematically evaluates monophone and triphone architectures, with within-word and cross-word context expansions, incorporating noise and boundary models and exploring retroflex allophones. The best result reaches 88.6% accuracy using within-word context-expanded triphones with eight Gaussian mixtures, while showing that data size and task characteristics limit gains from cross-word context or retroflex variants. The study demonstrates model flexibility across related datasets (Waxholm and Norwegian SpeechDat) and outlines concrete avenues for improving performance via more data, dialect-aware modeling, and noise mitigation, highlighting practical potential for Swedish ASR in varied applications.

Abstract

This paper is concerned with automatic continuous speech recognition using trainable systems. The aim of this work is to build acoustic models for spoken Swedish. This is done employing hidden Markov models and using the SpeechDat database to train their parameters. Acoustic modeling has been worked out at a phonetic level, allowing general speech recognition applications, even though a simplified task (digits and natural number recognition) has been considered for model evaluation. Different kinds of phone models have been tested, including context independent models and two variations of context dependent models. Furthermore many experiments have been done with bigram language models to tune some of the system parameters. System performance over various speaker subsets with different sex, age and dialect has also been examined. Results are compared to previous similar studies showing a remarkable improvement.

Developing Acoustic Models for Automatic Speech Recognition in Swedish

TL;DR

This work addresses robust, speaker-independent automatic speech recognition for Swedish over telephone lines by building HMM-based acoustic models trained on the SpeechDat Swedish corpus. It systematically evaluates monophone and triphone architectures, with within-word and cross-word context expansions, incorporating noise and boundary models and exploring retroflex allophones. The best result reaches 88.6% accuracy using within-word context-expanded triphones with eight Gaussian mixtures, while showing that data size and task characteristics limit gains from cross-word context or retroflex variants. The study demonstrates model flexibility across related datasets (Waxholm and Norwegian SpeechDat) and outlines concrete avenues for improving performance via more data, dialect-aware modeling, and noise mitigation, highlighting practical potential for Swedish ASR in varied applications.

Abstract

This paper is concerned with automatic continuous speech recognition using trainable systems. The aim of this work is to build acoustic models for spoken Swedish. This is done employing hidden Markov models and using the SpeechDat database to train their parameters. Acoustic modeling has been worked out at a phonetic level, allowing general speech recognition applications, even though a simplified task (digits and natural number recognition) has been considered for model evaluation. Different kinds of phone models have been tested, including context independent models and two variations of context dependent models. Furthermore many experiments have been done with bigram language models to tune some of the system parameters. System performance over various speaker subsets with different sex, age and dialect has also been examined. Results are compared to previous similar studies showing a remarkable improvement.
Paper Structure (24 sections, 7 figures, 3 tables)

This paper contains 24 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Model topology for different applications
  • Figure 2: Development test (50 speakers). Monophones, 1, 2, 4, 8 Gaussian terms, old lexicon (blue line) and new lexicon (red line)
  • Figure 3: a: Tree clustering threshold optimization (Accuracy/# of states): within word context expanded models, old lexicon (blue line), new lexicon (red line). b: Tree clustering threshold optimization (Accuracy/# of states): cross word context expanded models, old lexicon (blue line), new lexicon (red line)
  • Figure 4: a: Accuracy/iterations 1, 2, 4, 8 Gaussian terms: within word context expansion, old lexicon (blue line) and new lexicon (red line). b: Accuracy/iterations 1, 2, 4, 8 Gaussian terms: cross word context expansion, old lexicon (blue line) and new lexicon (red line)
  • Figure 5: a: Number of speakers in 10% Accuracy ranges. Number of speakers in 10% Accuracy ranges (log)
  • ...and 2 more figures