Table of Contents
Fetching ...

A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport

Yacouba Kaloga, Shashi Kumar, Petr Motlicek, Ina Kodrasi

TL;DR

This work tackles precise seq2seq alignment in automatic speech recognition by replacing marginal-path losses (e.g., CTC) with a differentiable, one-dimensional optimal transport framework. It introduces Sequence Optimal Transport Distance ($SOTD$) as a pseudo-metric over finite sequences and derives the Optimal Temporal Transport Classification (OTTC) loss, which jointly learns an alignment and token predictions with linear-time, linear-space complexity $O(\max(n,m))$. The method builds a differentiable, monotonic 1D OT mapping $\boldsymbol{\gamma}_n^{m,\boldsymbol{\beta}}$ parameterized by $\boldsymbol{\alpha}$ and $\boldsymbol{\beta}$, enabling exact single-path emphasis and reducing peaky behavior compared to CTC; experiments on TIMIT, AMI, and LibriSpeech show improved alignment metrics and controlled trade-offs with WER. The approach provides a principled framework for seq2seq alignment that could extend to other modalities and tasks, with public code and promising implications for alignment-sensitive applications. Key contributions include the formal definition and properties of $SOTD$, the differentiable 1D OT alignment mechanism with $O(\max(n,m))$ scaling, and the OTTC loss demonstrating improved temporal alignment and interpretability over existing E2E ASR losses.

Abstract

Accurate sequence-to-sequence (seq2seq) alignment is critical for applications like medical speech analysis and language learning tools relying on automatic speech recognition (ASR). State-of-the-art end-to-end (E2E) ASR systems, such as the Connectionist Temporal Classification (CTC) and transducer-based models, suffer from peaky behavior and alignment inaccuracies. In this paper, we propose a novel differentiable alignment framework based on one-dimensional optimal transport, enabling the model to learn a single alignment and perform ASR in an E2E manner. We introduce a pseudo-metric, called Sequence Optimal Transport Distance (SOTD), over the sequence space and discuss its theoretical properties. Based on the SOTD, we propose Optimal Temporal Transport Classification (OTTC) loss for ASR and contrast its behavior with CTC. Experimental results on the TIMIT, AMI, and LibriSpeech datasets show that our method considerably improves alignment performance compared to CTC and the more recently proposed Consistency-Regularized CTC, though with a trade-off in ASR performance. We believe this work opens new avenues for seq2seq alignment research, providing a solid foundation for further exploration and development within the community. Our code is publicly available at: https://github.com/idiap/OTTC

A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport

TL;DR

This work tackles precise seq2seq alignment in automatic speech recognition by replacing marginal-path losses (e.g., CTC) with a differentiable, one-dimensional optimal transport framework. It introduces Sequence Optimal Transport Distance () as a pseudo-metric over finite sequences and derives the Optimal Temporal Transport Classification (OTTC) loss, which jointly learns an alignment and token predictions with linear-time, linear-space complexity . The method builds a differentiable, monotonic 1D OT mapping parameterized by and , enabling exact single-path emphasis and reducing peaky behavior compared to CTC; experiments on TIMIT, AMI, and LibriSpeech show improved alignment metrics and controlled trade-offs with WER. The approach provides a principled framework for seq2seq alignment that could extend to other modalities and tasks, with public code and promising implications for alignment-sensitive applications. Key contributions include the formal definition and properties of , the differentiable 1D OT alignment mechanism with scaling, and the OTTC loss demonstrating improved temporal alignment and interpretability over existing E2E ASR losses.

Abstract

Accurate sequence-to-sequence (seq2seq) alignment is critical for applications like medical speech analysis and language learning tools relying on automatic speech recognition (ASR). State-of-the-art end-to-end (E2E) ASR systems, such as the Connectionist Temporal Classification (CTC) and transducer-based models, suffer from peaky behavior and alignment inaccuracies. In this paper, we propose a novel differentiable alignment framework based on one-dimensional optimal transport, enabling the model to learn a single alignment and perform ASR in an E2E manner. We introduce a pseudo-metric, called Sequence Optimal Transport Distance (SOTD), over the sequence space and discuss its theoretical properties. Based on the SOTD, we propose Optimal Temporal Transport Classification (OTTC) loss for ASR and contrast its behavior with CTC. Experimental results on the TIMIT, AMI, and LibriSpeech datasets show that our method considerably improves alignment performance compared to CTC and the more recently proposed Consistency-Regularized CTC, though with a trade-off in ASR performance. We believe this work opens new avenues for seq2seq alignment research, providing a solid foundation for further exploration and development within the community. Our code is publicly available at: https://github.com/idiap/OTTC

Paper Structure

This paper contains 27 sections, 35 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: Alignment between embeddings of frames and target sequence. Red bullets represent the elements of the target sequence $\{{\bm{y}}\}_m$, while the blue bullets indicate the frame embeddings $\{{\bm{x}}\}_n$. In OTTC, the alignment guides the prediction model $F$ in determining which frames should map to which labels. Additionally, the alignment model has the flexibility to leave some frames unaligned, as represented by the blue-and-white bullets, allowing those frames to be dropped during inference.
  • Figure 2: Discrete monotonic alignment as 1D OT solution. A discrete monotonic alignment represents a temporal alignment between two sequences (target on top, frame embeddings on bottom). It can be modeled by $\boldsymbol{\gamma}_n^{m,\boldsymbol{\beta}}$, as illustrated in the graph. The thickness of the links reflects the amount of mass $\boldsymbol{\gamma}_n^{m,\boldsymbol{\beta}}(\boldsymbol{\alpha})_{i,j}$ transported, with thicker links corresponding to higher mass.
  • Figure 3: A CTC alignment.Here, we illustrate one of the valid alignments for CTC. The CTC loss maximizes the marginal probability over all such possible alignments.
  • Figure 4: CTC and OTTC alignments. Phoneme-level transcription of CTC and OTTC, compared to a reference from TIMIT.
  • Figure 5: 1D OT transport computation. Illustration of the optimal transport process, computed iteratively by transferring probability mass from the smallest bins to the largest.
  • ...and 2 more figures