Table of Contents
Fetching ...

Align With Purpose: Optimize Desired Properties in CTC Models with a General Plug-and-Play Framework

Eliya Segev, Maya Alroy, Ronen Katsir, Noam Wies, Ayana Shenhav, Yael Ben-Oren, David Zar, Oren Tadmor, Jacob Bitterman, Amnon Shashua, Tal Rosenwein

TL;DR

The paper tackles the limitation of CTC's perfect-vs-imperfect alignment dichotomy by introducing Align With Purpose (AWP), a general Plug-and-Play framework that adds a property-specific loss L_AWP to the standard CTC objective L_CTC, guiding alignments toward a desired attribute via a property function f_prop. By sampling alignments from the model, transforming them with f_prop, and applying a hinge loss, AWP yields the combined loss L(x) = L_CTC(x) + α L_AWP(x) and can target properties like emission-time (low latency) and minimum WER (mWER). Empirical results on ASR across multiple architectures and scales (up to 280K hours) show up to ~570 ms DL reduction and ~4–4.5% relative WER improvement, with results demonstrating cross-architecture generality and minimal code changes. The framework is adaptable to other alignment-free objectives and domains, offering a simple pathway to prioritize diverse alignment properties during training while preserving transcription capabilities.

Abstract

Connectionist Temporal Classification (CTC) is a widely used criterion for training supervised sequence-to-sequence (seq2seq) models. It enables learning the relations between input and output sequences, termed alignments, by marginalizing over perfect alignments (that yield the ground truth), at the expense of imperfect alignments. This binary differentiation of perfect and imperfect alignments falls short of capturing other essential alignment properties that hold significance in other real-world applications. Here we propose $\textit{Align With Purpose}$, a $\textbf{general Plug-and-Play framework}$ for enhancing a desired property in models trained with the CTC criterion. We do that by complementing the CTC with an additional loss term that prioritizes alignments according to a desired property. Our method does not require any intervention in the CTC loss function, enables easy optimization of a variety of properties, and allows differentiation between both perfect and imperfect alignments. We apply our framework in the domain of Automatic Speech Recognition (ASR) and show its generality in terms of property selection, architectural choice, and scale of training dataset (up to 280,000 hours). To demonstrate the effectiveness of our framework, we apply it to two unrelated properties: emission time and word error rate (WER). For the former, we report an improvement of up to 570ms in latency optimization with a minor reduction in WER, and for the latter, we report a relative improvement of 4.5% WER over the baseline models. To the best of our knowledge, these applications have never been demonstrated to work on a scale of data as large as ours. Notably, our method can be implemented using only a few lines of code, and can be extended to other alignment-free loss functions and to domains other than ASR.

Align With Purpose: Optimize Desired Properties in CTC Models with a General Plug-and-Play Framework

TL;DR

The paper tackles the limitation of CTC's perfect-vs-imperfect alignment dichotomy by introducing Align With Purpose (AWP), a general Plug-and-Play framework that adds a property-specific loss L_AWP to the standard CTC objective L_CTC, guiding alignments toward a desired attribute via a property function f_prop. By sampling alignments from the model, transforming them with f_prop, and applying a hinge loss, AWP yields the combined loss L(x) = L_CTC(x) + α L_AWP(x) and can target properties like emission-time (low latency) and minimum WER (mWER). Empirical results on ASR across multiple architectures and scales (up to 280K hours) show up to ~570 ms DL reduction and ~4–4.5% relative WER improvement, with results demonstrating cross-architecture generality and minimal code changes. The framework is adaptable to other alignment-free objectives and domains, offering a simple pathway to prioritize diverse alignment properties during training while preserving transcription capabilities.

Abstract

Connectionist Temporal Classification (CTC) is a widely used criterion for training supervised sequence-to-sequence (seq2seq) models. It enables learning the relations between input and output sequences, termed alignments, by marginalizing over perfect alignments (that yield the ground truth), at the expense of imperfect alignments. This binary differentiation of perfect and imperfect alignments falls short of capturing other essential alignment properties that hold significance in other real-world applications. Here we propose , a for enhancing a desired property in models trained with the CTC criterion. We do that by complementing the CTC with an additional loss term that prioritizes alignments according to a desired property. Our method does not require any intervention in the CTC loss function, enables easy optimization of a variety of properties, and allows differentiation between both perfect and imperfect alignments. We apply our framework in the domain of Automatic Speech Recognition (ASR) and show its generality in terms of property selection, architectural choice, and scale of training dataset (up to 280,000 hours). To demonstrate the effectiveness of our framework, we apply it to two unrelated properties: emission time and word error rate (WER). For the former, we report an improvement of up to 570ms in latency optimization with a minor reduction in WER, and for the latter, we report a relative improvement of 4.5% WER over the baseline models. To the best of our knowledge, these applications have never been demonstrated to work on a scale of data as large as ours. Notably, our method can be implemented using only a few lines of code, and can be extended to other alignment-free loss functions and to domains other than ASR.
Paper Structure (22 sections, 7 equations, 7 figures, 9 tables)

This paper contains 22 sections, 7 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: The Align With Purpose flow: $N$ alignments are sampled from the output of a pre-trained CTC model on which $f_{prop}$ is applied to create $N$ pairs of alignments. Then, hinge loss with an adjustable weight is applied on the probabilities of each pair of alignments, trained jointly with a CTC loss. See full details in section \ref{['sec:AWP']}
  • Figure 2: A visualization of two properties that are not captured by CTC. (a) Emission Time: Two alignments that yield the same text, but the green alignment emits the last token of 'CAT' at timestamp 3 (t_3) while the purple alignment emits it at t_6. (b) Word-Error-Rate: two imperfect predictions with the same CER but different WER.
  • Figure 3: Drift in emission time in a CTC model. Bottom purple text: An offline Stacked ResNet model with symmetric padding, with 6.4 seconds of context divided equally between past and future contexts. Top green text: An online Stacked Resnet with asymmetric padding, with 430ms future context and 5.97 seconds past context. It can be seen that the output of the online model has a drift ${\ge}$200 ms.
  • Figure 4: Defining $f_{low\_latency}$. To obtain $\Bar{{\bm{a}}}$, we shift the sampled alignment ${\bm{a}}$ one token to the left, starting from a random position (second token in this example) within the alignment, and pad $\Bar{{\bm{a}}}$ with a trailing blank token, marked by a black rectangle
  • Figure 5: Defining $f_{mWER}$. Given a target transcription 'the cat', the (upper) sampled alignment yields the text 'tha cet', which has 100% WER. Substituting the occurrences of the token 'e' with the token 'a' produces the text 'tha cat', which has 50% WER.
  • ...and 2 more figures