Table of Contents
Fetching ...

New Insights into Optimal Alignment of Acoustic and Linguistic Representations for Knowledge Transfer in ASR

Xugang Lu, Peng Shen, Hisashi Kawai

TL;DR

This work proposes an unbalanced optimal transport-based alignment model that explicitly handles distributional mismatch and structural asymmetries with soft and partial matching between acoustic and linguistic modalities, and ensures that every linguistic token is grounded in at least one acoustic observation.

Abstract

Aligning acoustic and linguistic representations is a central challenge to bridge the pre-trained models in knowledge transfer for automatic speech recognition (ASR). This alignment is inherently structured and asymmetric: while multiple consecutive acoustic frames typically correspond to a single linguistic token (many-to-one), certain acoustic transition regions may relate to multiple adjacent tokens (one-to-many). Moreover, acoustic sequences often include frames with no linguistic counterpart, such as background noise or silence may lead to imbalanced matching conditions. In this work, we take a new insight to regard alignment and matching as a detection problem, where the goal is to identify meaningful correspondences with high precision and recall ensuring full coverage of linguistic tokens while flexibly handling redundant or noisy acoustic frames in transferring linguistic knowledge for ASR. Based on this new insight, we propose an unbalanced optimal transport-based alignment model that explicitly handles distributional mismatch and structural asymmetries with soft and partial matching between acoustic and linguistic modalities. Our method ensures that every linguistic token is grounded in at least one acoustic observation, while allowing for flexible, probabilistic mappings from acoustic to linguistic units. We evaluate our proposed model with experiments on an CTC-based ASR system with a pre-trained language model for knowledge transfer. Experimental results demonstrate the effectiveness of our approach in flexibly controlling degree of matching and hence to improve ASR performance.

New Insights into Optimal Alignment of Acoustic and Linguistic Representations for Knowledge Transfer in ASR

TL;DR

This work proposes an unbalanced optimal transport-based alignment model that explicitly handles distributional mismatch and structural asymmetries with soft and partial matching between acoustic and linguistic modalities, and ensures that every linguistic token is grounded in at least one acoustic observation.

Abstract

Aligning acoustic and linguistic representations is a central challenge to bridge the pre-trained models in knowledge transfer for automatic speech recognition (ASR). This alignment is inherently structured and asymmetric: while multiple consecutive acoustic frames typically correspond to a single linguistic token (many-to-one), certain acoustic transition regions may relate to multiple adjacent tokens (one-to-many). Moreover, acoustic sequences often include frames with no linguistic counterpart, such as background noise or silence may lead to imbalanced matching conditions. In this work, we take a new insight to regard alignment and matching as a detection problem, where the goal is to identify meaningful correspondences with high precision and recall ensuring full coverage of linguistic tokens while flexibly handling redundant or noisy acoustic frames in transferring linguistic knowledge for ASR. Based on this new insight, we propose an unbalanced optimal transport-based alignment model that explicitly handles distributional mismatch and structural asymmetries with soft and partial matching between acoustic and linguistic modalities. Our method ensures that every linguistic token is grounded in at least one acoustic observation, while allowing for flexible, probabilistic mappings from acoustic to linguistic units. We evaluate our proposed model with experiments on an CTC-based ASR system with a pre-trained language model for knowledge transfer. Experimental results demonstrate the effectiveness of our approach in flexibly controlling degree of matching and hence to improve ASR performance.

Paper Structure

This paper contains 13 sections, 15 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The proposed cross-modal knowledge transfer learning framework for ASR.
  • Figure 2: Alignment and matching between acoustic and linguistic representations (thickness of arrow lines represent degree of cross-modal associations): (a) several consecutive acoustic frames are matched to one token (many to one matching); (b) acoustic transition frames are associated to two different tokens (one to many matching); (c) acoustic background or outliers has no corresponding linguistic tokens (NULL matching); (d) each linguistic token should have at least one best match in acoustic space.
  • Figure 3: Optimal transport coupling with different weights in controlling of the marginal distributions in alignment and matching between acoustic and linguistic representations: (a) cosine similarity matrix between acoustic and linguistic representations; (b) uniform alignment and matching with Gaussian-shaped temporal coherence (acoustic sequence is uniformly segmented and matched to the underlying tokens); (c) $\lambda_1=10.0$, $\lambda_2=10.0$; (d) $\lambda_1=0.1$, $\lambda_2=1.0$; (e) $\lambda_1=1.0$, $\lambda_2=1.0$; (f) $\lambda_1=0.01$, $\lambda_2=1.0$; (g) $\lambda_1=1.0$, $\lambda_2=0.01$; (h) $\lambda_1=0.05$, $\lambda_2=0.05$.
  • Figure 4: Gaussian-shaped uniform alignment vs adaptive alignment based on UOT: uniform alignment with Gaussian smooth window size of $10$ (a), window size of $5$ (b), window size of $2$ (c), and UOT with marginal control $\lambda_1=0.5$, $\lambda_2=0.5$ (d).