Table of Contents
Fetching ...

Word Level Timestamp Generation for Automatic Speech Recognition and Translation

Ke Hu, Krishna Puvvada, Elena Rastorgueva, Zhehuai Chen, He Huang, Shuoyang Ding, Kunal Dhawan, Hainan Xu, Jagadeesh Balam, Boris Ginsburg

TL;DR

This paper addresses the need for word-level timestamps in ASR and AST by proposing a data-driven approach that augments the Canary model with a <|timestamp|> prompt and uses the NeMo Forced Aligner as a teacher to generate per-word start and end times. Timestamps are encoded as fixed <|t|> tokens and learned jointly with transcription through a teacher-guided training process, enabling timestamp prediction without a separate alignment module. The method achieves high precision/recall (around eighty to ninety percent) and small timing errors for ASR, while extending to AST with average timing errors around two hundred milliseconds and a measurable translation-quality trade-off. Compared to prior work like WhisperTimestamped, the approach delivers stronger ASR timestamping performance and demonstrates the feasibility of multilingual word-level timestamp generation for AST using teacher supervision and cross-language alignment.

Abstract

We introduce a data-driven approach for enabling word-level timestamp prediction in the Canary model. Accurate timestamp information is crucial for a variety of downstream tasks such as speech content retrieval and timed subtitles. While traditional hybrid systems and end-to-end (E2E) models may employ external modules for timestamp prediction, our approach eliminates the need for separate alignment mechanisms. By leveraging the NeMo Forced Aligner (NFA) as a teacher model, we generate word-level timestamps and train the Canary model to predict timestamps directly. We introduce a new <|timestamp|> token, enabling the Canary model to predict start and end timestamps for each word. Our method demonstrates precision and recall rates between 80% and 90%, with timestamp prediction errors ranging from 20 to 120 ms across four languages, with minimal WER degradation. Additionally, we extend our system to automatic speech translation (AST) tasks, achieving timestamp prediction errors around 200 milliseconds.

Word Level Timestamp Generation for Automatic Speech Recognition and Translation

TL;DR

This paper addresses the need for word-level timestamps in ASR and AST by proposing a data-driven approach that augments the Canary model with a <|timestamp|> prompt and uses the NeMo Forced Aligner as a teacher to generate per-word start and end times. Timestamps are encoded as fixed <|t|> tokens and learned jointly with transcription through a teacher-guided training process, enabling timestamp prediction without a separate alignment module. The method achieves high precision/recall (around eighty to ninety percent) and small timing errors for ASR, while extending to AST with average timing errors around two hundred milliseconds and a measurable translation-quality trade-off. Compared to prior work like WhisperTimestamped, the approach delivers stronger ASR timestamping performance and demonstrates the feasibility of multilingual word-level timestamp generation for AST using teacher supervision and cross-language alignment.

Abstract

We introduce a data-driven approach for enabling word-level timestamp prediction in the Canary model. Accurate timestamp information is crucial for a variety of downstream tasks such as speech content retrieval and timed subtitles. While traditional hybrid systems and end-to-end (E2E) models may employ external modules for timestamp prediction, our approach eliminates the need for separate alignment mechanisms. By leveraging the NeMo Forced Aligner (NFA) as a teacher model, we generate word-level timestamps and train the Canary model to predict timestamps directly. We introduce a new <|timestamp|> token, enabling the Canary model to predict start and end timestamps for each word. Our method demonstrates precision and recall rates between 80% and 90%, with timestamp prediction errors ranging from 20 to 120 ms across four languages, with minimal WER degradation. Additionally, we extend our system to automatic speech translation (AST) tasks, achieving timestamp prediction errors around 200 milliseconds.

Paper Structure

This paper contains 13 sections, 1 figure, 7 tables.

Figures (1)

  • Figure 1: Timestamp training data format. Our training data consists of both ASR or AST training data. Either < |timestamp |> and < |notimestamp |> prompt tokens are used for prompting. Both start and end timestamps are added for each word in training.