Improving Acoustic Word Embeddings through Correspondence Training of Self-supervised Speech Representations

Amit Meghanani; Thomas Hain

Improving Acoustic Word Embeddings through Correspondence Training of Self-supervised Speech Representations

Amit Meghanani, Thomas Hain

TL;DR

HuBERT-based CAE model achieves the best results for word discrimination in all languages, despite HuBERT being pre-trained on English only, and works well in cross-lingual settings.

Abstract

Acoustic word embeddings (AWEs) are vector representations of spoken words. An effective method for obtaining AWEs is the Correspondence Auto-Encoder (CAE). In the past, the CAE method has been associated with traditional MFCC features. Representations obtained from self-supervised learning (SSL)-based speech models such as HuBERT, Wav2vec2, etc., are outperforming MFCC in many downstream tasks. However, they have not been well studied in the context of learning AWEs. This work explores the effectiveness of CAE with SSL-based speech representations to obtain improved AWEs. Additionally, the capabilities of SSL-based speech models are explored in cross-lingual scenarios for obtaining AWEs. Experiments are conducted on five languages: Polish, Portuguese, Spanish, French, and English. HuBERT-based CAE model achieves the best results for word discrimination in all languages, despite Hu-BERT being pre-trained on English only. Also, the HuBERT-based CAE model works well in cross-lingual settings. It outperforms MFCC-based CAE models trained on the target languages when trained on one source language and tested on target languages.

Improving Acoustic Word Embeddings through Correspondence Training of Self-supervised Speech Representations

TL;DR

HuBERT-based CAE model achieves the best results for word discrimination in all languages, despite HuBERT being pre-trained on English only, and works well in cross-lingual settings.

Abstract

Paper Structure (17 sections, 1 equation, 2 figures, 8 tables)

This paper contains 17 sections, 1 equation, 2 figures, 8 tables.

Introduction
Methodology
Data Preparation
Experimental Setup
Feature Extraction
SSL-based Speech Representations
MFCC Features
Model Details
Word Discrimination Task
Training Details
Results and Analysis
Cross-lingual Analysis
Analysis of Anagram Pairs
AWE Visualisation
Conclusions and Future Work
...and 2 more sections

Figures (2)

Figure 1: CAE-RNN training setup for extracting AWEs cae-rnn-1.
Figure 2: t-SNE visualisation of the AWEs derived from HuBERT-based CAE-RNN model for all five languages. From each language, all spoken instances of the top 7 words with the highest frequency count from the test set are chosen.

Improving Acoustic Word Embeddings through Correspondence Training of Self-supervised Speech Representations

TL;DR

Abstract

Improving Acoustic Word Embeddings through Correspondence Training of Self-supervised Speech Representations

Authors

TL;DR

Abstract

Table of Contents

Figures (2)