Improving Spoken Language Modeling with Phoneme Classification: A Simple Fine-tuning Approach

Maxime Poli; Emmanuel Chemla; Emmanuel Dupoux

Improving Spoken Language Modeling with Phoneme Classification: A Simple Fine-tuning Approach

Maxime Poli, Emmanuel Chemla, Emmanuel Dupoux

TL;DR

It is shown that fine-tuning speech representation models on phoneme classification leads to more context-invariant representations, and language models trained on these units achieve comparable lexical comprehension to ones trained on hundred times more data.

Abstract

Recent progress in Spoken Language Modeling has shown that learning language directly from speech is feasible. Generating speech through a pipeline that operates at the text level typically loses nuances, intonations, and non-verbal vocalizations. Modeling directly from speech opens up the path to more natural and expressive systems. On the other hand, speech-only systems require up to three orders of magnitude more data to catch up to their text-based counterparts in terms of their semantic abilities. We show that fine-tuning speech representation models on phoneme classification leads to more context-invariant representations, and language models trained on these units achieve comparable lexical comprehension to ones trained on hundred times more data.

Improving Spoken Language Modeling with Phoneme Classification: A Simple Fine-tuning Approach

TL;DR

Abstract

Paper Structure (17 sections, 4 figures, 5 tables)

This paper contains 17 sections, 4 figures, 5 tables.

Introduction and related work
Method
Phoneme classification
Quantization
Language modeling
Speech resynthesis
Evaluation metrics
Results
Results at the phonemic level
Results above the phonemic level
Conclusion
Limitations
Appendix
Fine-tuning results
Discrete units quality
...and 2 more sections

Figures (4)

Figure 1: Trade-off between language modeling and expressive resynthesis. *: embeddings initialized from unit centroids.
Figure 2: ABX error rate averaged across subset (dev-clean, dev-other) and speaker (within, across) conditions.
Figure 3: ABX error rate for models finetuned with CTC, averaged across subset and speaker conditions.
Figure 4: Difference between the MCD of the fine-tuned models and Base L11 on Expresso for each style.

Improving Spoken Language Modeling with Phoneme Classification: A Simple Fine-tuning Approach

TL;DR

Abstract

Improving Spoken Language Modeling with Phoneme Classification: A Simple Fine-tuning Approach

Authors

TL;DR

Abstract

Table of Contents

Figures (4)