Table of Contents
Fetching ...

TeLeS: Temporal Lexeme Similarity Score to Estimate Confidence in End-to-End ASR

Nagarathna Ravi, Thishyan Raj T, Vipul Arora

TL;DR

TeLeS introduces Temporal-Lexeme Similarity as a continuous target for confidence estimation in end-to-end ASR, addressing the shortcomings of binary labels and overconfident predictions. The method trains a Word-Level Confidence model (TeLeS-WLC) using intermediate ASR states and a shrinkage loss to balance data, and extends to TeLeS-A for active learning by selecting informative pseudo-labeled samples. Across Hindi, Tamil, and Kannada, TeLeS-WLC achieves better calibration and WER improvements than SOTA baselines and demonstrates generalization to mismatched domains via KB datasets. The work also provides an open-source Hindi dataset and demonstrates practical HITL data acquisition benefits for robust domain adaptation.

Abstract

Confidence estimation of predictions from an End-to-End (E2E) Automatic Speech Recognition (ASR) model benefits ASR's downstream and upstream tasks. Class-probability-based confidence scores do not accurately represent the quality of overconfident ASR predictions. An ancillary Confidence Estimation Model (CEM) calibrates the predictions. State-of-the-art (SOTA) solutions use binary target scores for CEM training. However, the binary labels do not reveal the granular information of predicted words, such as temporal alignment between reference and hypothesis and whether the predicted word is entirely incorrect or contains spelling errors. Addressing this issue, we propose a novel Temporal-Lexeme Similarity (TeLeS) confidence score to train CEM. To address the data imbalance of target scores while training CEM, we use shrinkage loss to focus on hard-to-learn data points and minimise the impact of easily learned data points. We conduct experiments with ASR models trained in three languages, namely Hindi, Tamil, and Kannada, with varying training data sizes. Experiments show that TeLeS generalises well across domains. To demonstrate the applicability of the proposed method, we formulate a TeLeS-based Acquisition (TeLeS-A) function for sampling uncertainty in active learning. We observe a significant reduction in the Word Error Rate (WER) as compared to SOTA methods.

TeLeS: Temporal Lexeme Similarity Score to Estimate Confidence in End-to-End ASR

TL;DR

TeLeS introduces Temporal-Lexeme Similarity as a continuous target for confidence estimation in end-to-end ASR, addressing the shortcomings of binary labels and overconfident predictions. The method trains a Word-Level Confidence model (TeLeS-WLC) using intermediate ASR states and a shrinkage loss to balance data, and extends to TeLeS-A for active learning by selecting informative pseudo-labeled samples. Across Hindi, Tamil, and Kannada, TeLeS-WLC achieves better calibration and WER improvements than SOTA baselines and demonstrates generalization to mismatched domains via KB datasets. The work also provides an open-source Hindi dataset and demonstrates practical HITL data acquisition benefits for robust domain adaptation.

Abstract

Confidence estimation of predictions from an End-to-End (E2E) Automatic Speech Recognition (ASR) model benefits ASR's downstream and upstream tasks. Class-probability-based confidence scores do not accurately represent the quality of overconfident ASR predictions. An ancillary Confidence Estimation Model (CEM) calibrates the predictions. State-of-the-art (SOTA) solutions use binary target scores for CEM training. However, the binary labels do not reveal the granular information of predicted words, such as temporal alignment between reference and hypothesis and whether the predicted word is entirely incorrect or contains spelling errors. Addressing this issue, we propose a novel Temporal-Lexeme Similarity (TeLeS) confidence score to train CEM. To address the data imbalance of target scores while training CEM, we use shrinkage loss to focus on hard-to-learn data points and minimise the impact of easily learned data points. We conduct experiments with ASR models trained in three languages, namely Hindi, Tamil, and Kannada, with varying training data sizes. Experiments show that TeLeS generalises well across domains. To demonstrate the applicability of the proposed method, we formulate a TeLeS-based Acquisition (TeLeS-A) function for sampling uncertainty in active learning. We observe a significant reduction in the Word Error Rate (WER) as compared to SOTA methods.
Paper Structure (23 sections, 15 equations, 5 figures, 12 tables)

This paper contains 23 sections, 15 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Class-prob. based Scores of Correct and Wrong Words
  • Figure 2: TeLeS-WLC Train Architecture
  • Figure 3: Distribution of Words across TeLeS Score Range
  • Figure 4: Calibration Curve - Mask-MAE-L
  • Figure 5: Calibration Curve - Mask-Shrink-L