Table of Contents
Fetching ...

Multilingual acoustic word embeddings for zero-resource languages

Christiaan Jacobs

TL;DR

This work tackles the challenge of building speech technologies for zero-resource languages by advancing acoustic word embeddings (AWEs) and exploring multilingual transfer. It introduces a novel ContrastiveRNN model and investigates unsupervised adaptation of multilingual AWE models to target languages, showing consistent gains in intrinsic word-discrimination and downstream tasks. The thesis also examines how training languages, especially related languages, influence performance and demonstrates a practical ASR-free hate-speech keyword spotting system in Swahili, highlighting real-world robustness. Furthermore, it proposes semantic AWEs by leveraging multilingual knowledge, achieving state-of-the-art semantic similarity and enabling semantic query-by-example retrieval. Overall, the work demonstrates the versatility of multilingual AWEs for rapid deployment of speech applications in low-resource settings and highlights clear avenues for future research into segmentation, semantics, and broader language coverage.

Abstract

This research addresses the challenge of developing speech applications for zero-resource languages that lack labelled data. It specifically uses acoustic word embedding (AWE) -- fixed-dimensional representations of variable-duration speech segments -- employing multilingual transfer, where labelled data from several well-resourced languages are used for pertaining. The study introduces a new neural network that outperforms existing AWE models on zero-resource languages. It explores the impact of the choice of well-resourced languages. AWEs are applied to a keyword-spotting system for hate speech detection in Swahili radio broadcasts, demonstrating robustness in real-world scenarios. Additionally, novel semantic AWE models improve semantic query-by-example search.

Multilingual acoustic word embeddings for zero-resource languages

TL;DR

This work tackles the challenge of building speech technologies for zero-resource languages by advancing acoustic word embeddings (AWEs) and exploring multilingual transfer. It introduces a novel ContrastiveRNN model and investigates unsupervised adaptation of multilingual AWE models to target languages, showing consistent gains in intrinsic word-discrimination and downstream tasks. The thesis also examines how training languages, especially related languages, influence performance and demonstrates a practical ASR-free hate-speech keyword spotting system in Swahili, highlighting real-world robustness. Furthermore, it proposes semantic AWEs by leveraging multilingual knowledge, achieving state-of-the-art semantic similarity and enabling semantic query-by-example retrieval. Overall, the work demonstrates the versatility of multilingual AWEs for rapid deployment of speech applications in low-resource settings and highlights clear avenues for future research into segmentation, semantics, and broader language coverage.

Abstract

This research addresses the challenge of developing speech applications for zero-resource languages that lack labelled data. It specifically uses acoustic word embedding (AWE) -- fixed-dimensional representations of variable-duration speech segments -- employing multilingual transfer, where labelled data from several well-resourced languages are used for pertaining. The study introduces a new neural network that outperforms existing AWE models on zero-resource languages. It explores the impact of the choice of well-resourced languages. AWEs are applied to a keyword-spotting system for hate speech detection in Swahili radio broadcasts, demonstrating robustness in real-world scenarios. Additionally, novel semantic AWE models improve semantic query-by-example search.
Paper Structure (118 sections, 22 equations, 36 figures, 18 tables)

This paper contains 118 sections, 22 equations, 36 figures, 18 tables.

Figures (36)

  • Figure 1: Illustrating the difference between (a) TWEs and (b) AWEs. In a TWE space, words sharing contextual meaning end up close to each other (different shades of the same colour). For AWEs, each spoken instance maps to a unique embedding (multiple of the same colour) with acoustically similar segments positioned close to each other.
  • Figure 2: Semantic AWEs capture acoustic similarity among instances of the same word type (represented by the same colour) while also preserving word meaning (reflected in the shade of the same colour).
  • Figure 3: The structure of an RNN with one recurrent layer.
  • Figure 4: The AE and CAE structure.
  • Figure 5: The Siamese network structure.
  • ...and 31 more figures