Table of Contents
Fetching ...

Contrastive prediction strategies for unsupervised segmentation and categorization of phonemes and words

Santiago Cuervo, Maciej Grabias, Jan Chorowski, Grzegorz Ciesielski, Adrian Łańcucki, Paweł Rychlikowski, Ricard Marxer

TL;DR

Multi-level Aligned CPC (mACPC) is proposed, a variant of CPC which exhibits the best performance on categorization tasks, and incorporates multi-level modeling and optimization for detection of spectral changes.

Abstract

We investigate the performance on phoneme categorization and phoneme and word segmentation of several self-supervised learning (SSL) methods based on Contrastive Predictive Coding (CPC). Our experiments show that with the existing algorithms there is a trade off between categorization and segmentation performance. We investigate the source of this conflict and conclude that the use of context building networks, albeit necessary for superior performance on categorization tasks, harms segmentation performance by causing a temporal shift on the learned representations. Aiming to bridge this gap, we take inspiration from the leading approach on segmentation, which simultaneously models the speech signal at the frame and phoneme level, and incorporate multi-level modelling into Aligned CPC (ACPC), a variation of CPC which exhibits the best performance on categorization tasks. Our multi-level ACPC (mACPC) improves in all categorization metrics and achieves state-of-the-art performance in word segmentation.

Contrastive prediction strategies for unsupervised segmentation and categorization of phonemes and words

TL;DR

Multi-level Aligned CPC (mACPC) is proposed, a variant of CPC which exhibits the best performance on categorization tasks, and incorporates multi-level modeling and optimization for detection of spectral changes.

Abstract

We investigate the performance on phoneme categorization and phoneme and word segmentation of several self-supervised learning (SSL) methods based on Contrastive Predictive Coding (CPC). Our experiments show that with the existing algorithms there is a trade off between categorization and segmentation performance. We investigate the source of this conflict and conclude that the use of context building networks, albeit necessary for superior performance on categorization tasks, harms segmentation performance by causing a temporal shift on the learned representations. Aiming to bridge this gap, we take inspiration from the leading approach on segmentation, which simultaneously models the speech signal at the frame and phoneme level, and incorporate multi-level modelling into Aligned CPC (ACPC), a variation of CPC which exhibits the best performance on categorization tasks. Our multi-level ACPC (mACPC) improves in all categorization metrics and achieves state-of-the-art performance in word segmentation.

Paper Structure

This paper contains 9 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The mACPC model has two main modules: frame-level and segment level. The frame-level module works on raw waveforms and extracts latent representations. These are processed by the boundary detector, which predicts boundaries and averages latents within those boundaries to produce segment representations. Finally, the segment-level module learns to predict higher-level features.
  • Figure 2: Segmentation performance for different predicted boundary shifts on TIMIT (left) and Buckeye (right). Models with context builders perform better with an offset, indicating a representation shift.