Multilingual acoustic word embeddings for zero-resource languages

Christiaan Jacobs

Multilingual acoustic word embeddings for zero-resource languages

Christiaan Jacobs

TL;DR

This work tackles the challenge of building speech technologies for zero-resource languages by advancing acoustic word embeddings (AWEs) and exploring multilingual transfer. It introduces a novel ContrastiveRNN model and investigates unsupervised adaptation of multilingual AWE models to target languages, showing consistent gains in intrinsic word-discrimination and downstream tasks. The thesis also examines how training languages, especially related languages, influence performance and demonstrates a practical ASR-free hate-speech keyword spotting system in Swahili, highlighting real-world robustness. Furthermore, it proposes semantic AWEs by leveraging multilingual knowledge, achieving state-of-the-art semantic similarity and enabling semantic query-by-example retrieval. Overall, the work demonstrates the versatility of multilingual AWEs for rapid deployment of speech applications in low-resource settings and highlights clear avenues for future research into segmentation, semantics, and broader language coverage.

Abstract

This research addresses the challenge of developing speech applications for zero-resource languages that lack labelled data. It specifically uses acoustic word embedding (AWE) -- fixed-dimensional representations of variable-duration speech segments -- employing multilingual transfer, where labelled data from several well-resourced languages are used for pertaining. The study introduces a new neural network that outperforms existing AWE models on zero-resource languages. It explores the impact of the choice of well-resourced languages. AWEs are applied to a keyword-spotting system for hate speech detection in Swahili radio broadcasts, demonstrating robustness in real-world scenarios. Additionally, novel semantic AWE models improve semantic query-by-example search.

Multilingual acoustic word embeddings for zero-resource languages

TL;DR

Abstract

Paper Structure (118 sections, 22 equations, 36 figures, 18 tables)

This paper contains 118 sections, 22 equations, 36 figures, 18 tables.

Introduction
Motivation
Goals and methodology
Acoustic word embeddings
Acoustic word embeddings in a zero-resource setting
Downstream application
Semantic acoustic word embeddings
Contributions
Publications
Thesis overview
Background
Neural networks
Recurrent neural network
Autoencoder
Correspondence autoencoder
...and 103 more sections

Figures (36)

Figure 1: Illustrating the difference between (a) TWEs and (b) AWEs. In a TWE space, words sharing contextual meaning end up close to each other (different shades of the same colour). For AWEs, each spoken instance maps to a unique embedding (multiple of the same colour) with acoustically similar segments positioned close to each other.
Figure 2: Semantic AWEs capture acoustic similarity among instances of the same word type (represented by the same colour) while also preserving word meaning (reflected in the shade of the same colour).
Figure 3: The structure of an RNN with one recurrent layer.
Figure 4: The AE and CAE structure.
Figure 5: The Siamese network structure.
...and 31 more figures

Multilingual acoustic word embeddings for zero-resource languages

TL;DR

Abstract

Multilingual acoustic word embeddings for zero-resource languages

Authors

TL;DR

Abstract

Table of Contents

Figures (36)