Table of Contents
Fetching ...

Target word activity detector: An approach to obtain ASR word boundaries without lexicon

Sunit Sivasankaran, Eric Sun, Jinyu Li, Yan Huang, Jing Pan

TL;DR

This paper tackles the problem of obtaining word boundaries from end-to-end ASR without relying on lexicons, a bottleneck for multilingual models. It introduces the Target Word Activity Detector (TWAD), which learns word embeddings from subword tokens and a pretrained ASR to estimate a word activity matrix $\hat{\mathbf{A}} \in \mathbb{R}^{N\times W}$ and derives boundaries via discrete time warping. Evaluated on a multilingual ASR trained on five languages, TWAD achieves word-timing errors comparable to a strong lexicon-based baseline while avoiding lexicon dependencies and scaling limitations. The approach significantly reduces the linguistic resources needed for word timing in multilingual settings, enabling efficient downstream tasks such as diarization and editing. Future work includes incorporating punctuation and subword tokenization details to further improve boundary accuracy.

Abstract

Obtaining word timestamp information from end-to-end (E2E) ASR models remains challenging due to the lack of explicit time alignment during training. This issue is further complicated in multilingual models. Existing methods, either rely on lexicons or introduce additional tokens, leading to scalability issues and increased computational costs. In this work, we propose a new approach to estimate word boundaries without relying on lexicons. Our method leverages word embeddings from sub-word token units and a pretrained ASR model, requiring only word alignment information during training. Our proposed method can scale-up to any number of languages without incurring any additional cost. We validate our approach using a multilingual ASR model trained on five languages and demonstrate its effectiveness against a strong baseline.

Target word activity detector: An approach to obtain ASR word boundaries without lexicon

TL;DR

This paper tackles the problem of obtaining word boundaries from end-to-end ASR without relying on lexicons, a bottleneck for multilingual models. It introduces the Target Word Activity Detector (TWAD), which learns word embeddings from subword tokens and a pretrained ASR to estimate a word activity matrix and derives boundaries via discrete time warping. Evaluated on a multilingual ASR trained on five languages, TWAD achieves word-timing errors comparable to a strong lexicon-based baseline while avoiding lexicon dependencies and scaling limitations. The approach significantly reduces the linguistic resources needed for word timing in multilingual settings, enabling efficient downstream tasks such as diarization and editing. Future work includes incorporating punctuation and subword tokenization details to further improve boundary accuracy.

Abstract

Obtaining word timestamp information from end-to-end (E2E) ASR models remains challenging due to the lack of explicit time alignment during training. This issue is further complicated in multilingual models. Existing methods, either rely on lexicons or introduce additional tokens, leading to scalability issues and increased computational costs. In this work, we propose a new approach to estimate word boundaries without relying on lexicons. Our method leverages word embeddings from sub-word token units and a pretrained ASR model, requiring only word alignment information during training. Our proposed method can scale-up to any number of languages without incurring any additional cost. We validate our approach using a multilingual ASR model trained on five languages and demonstrate its effectiveness against a strong baseline.
Paper Structure (10 sections, 1 equation, 2 figures, 2 tables)

This paper contains 10 sections, 1 equation, 2 figures, 2 tables.

Figures (2)

  • Figure 1: TWAD model architecture. Only parameters inside the TWAD model are updated. $W$ is the number of words in a sentence. $S$ is the max number of tokens across all the words in a sentence.
  • Figure 2: Output of the TWAD model for the sentence: 'Je vais te poser une question, je veux que tu me dises la vérité, tu es d'accord?' which after text normalization becomes 'je vais te poser une question je veux que tu me dises la vérité tu es d'accord'