Table of Contents
Fetching ...

Leveraging Broadcast Media Subtitle Transcripts for Automatic Speech Recognition and Subtitling

Jakob Poncelet, Hugo Van hamme

TL;DR

This work tackles weakly supervised ASR by leveraging large-scale TV subtitles alongside verbatim transcripts to train multitask models that produce both verbatim transcriptions and subtitles. It introduces cascaded encoder designs with dual decoders and cross-attention mechanisms, including a shared-task decoder option, and demonstrates that separating the subtitle and verbatim channels yields substantial improvements in WER and subtitle quality for Flemish Dutch. Across scaling experiments with up to 14k hours of subtitled data, the proposed cascaded dual-encoder model achieves state-of-the-art results versus baselines and a strong Whisper comparison, while maintaining far fewer parameters. The approach enables robust, scalable automatic subtitling and high-quality verbatim transcription suitable for downstream NLP applications, with broader implications for cross-domain speech processing and multilingual extensions.

Abstract

The recent advancement of speech recognition technology has been driven by large-scale datasets and attention-based architectures, but many challenges still remain, especially for low-resource languages and dialects. This paper explores the integration of weakly supervised transcripts from TV subtitles into automatic speech recognition (ASR) systems, aiming to improve both verbatim transcriptions and automatically generated subtitles. To this end, verbatim data and subtitles are regarded as different domains or languages, due to their distinct characteristics. We propose and compare several end-to-end architectures that are designed to jointly model both modalities with separate or shared encoders and decoders. The proposed methods are able to jointly generate a verbatim transcription and a subtitle. Evaluation on Flemish (Belgian Dutch) demonstrates that a model with cascaded encoders and separate decoders allows to represent the differences between the two data types most efficiently while improving on both domains. Despite differences in domain and linguistic variations, combining verbatim transcripts with subtitle data leads to notable ASR improvements without the need for extensive preprocessing. Additionally, experiments with a large-scale subtitle dataset show the scalability of the proposed approach. The methods not only improve ASR accuracy but also generate subtitles that closely match standard written text, offering several potential applications.

Leveraging Broadcast Media Subtitle Transcripts for Automatic Speech Recognition and Subtitling

TL;DR

This work tackles weakly supervised ASR by leveraging large-scale TV subtitles alongside verbatim transcripts to train multitask models that produce both verbatim transcriptions and subtitles. It introduces cascaded encoder designs with dual decoders and cross-attention mechanisms, including a shared-task decoder option, and demonstrates that separating the subtitle and verbatim channels yields substantial improvements in WER and subtitle quality for Flemish Dutch. Across scaling experiments with up to 14k hours of subtitled data, the proposed cascaded dual-encoder model achieves state-of-the-art results versus baselines and a strong Whisper comparison, while maintaining far fewer parameters. The approach enables robust, scalable automatic subtitling and high-quality verbatim transcription suitable for downstream NLP applications, with broader implications for cross-domain speech processing and multilingual extensions.

Abstract

The recent advancement of speech recognition technology has been driven by large-scale datasets and attention-based architectures, but many challenges still remain, especially for low-resource languages and dialects. This paper explores the integration of weakly supervised transcripts from TV subtitles into automatic speech recognition (ASR) systems, aiming to improve both verbatim transcriptions and automatically generated subtitles. To this end, verbatim data and subtitles are regarded as different domains or languages, due to their distinct characteristics. We propose and compare several end-to-end architectures that are designed to jointly model both modalities with separate or shared encoders and decoders. The proposed methods are able to jointly generate a verbatim transcription and a subtitle. Evaluation on Flemish (Belgian Dutch) demonstrates that a model with cascaded encoders and separate decoders allows to represent the differences between the two data types most efficiently while improving on both domains. Despite differences in domain and linguistic variations, combining verbatim transcripts with subtitle data leads to notable ASR improvements without the need for extensive preprocessing. Additionally, experiments with a large-scale subtitle dataset show the scalability of the proposed approach. The methods not only improve ASR accuracy but also generate subtitles that closely match standard written text, offering several potential applications.

Paper Structure

This paper contains 44 sections, 3 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Overview of the proposed approach. Verbatim transcriptions from ASR datasets and subtitle transcripts from a large source of broadcast media data are gathered. A dual speech-to-text model is trained to output a verbatim transcription for the input speech and to generate a well-suited subtitle at the same time, by jointly learning from both data streams.
  • Figure 2: Comparison of all proposed models.
  • Figure 3: Comparison between subtitle decoder blocks with only one cross-attention to the output of the subtitle encoder ("Transformer"), or with two cross-attentions, i.e. once to the output of the ASR encoder and once to the output of the subtitle encoder ("Multi-Transformer"). The subtitle encoder is either conditioned on the ASR encoder ("Enc.") or decoder ("Dec.") features.
  • Figure 4: Scaling experiments for the cascaded model with dual encoder features. On the horizontal axis, increasing amounts of subtitled training data are used. Figure (a) shows WERs ($\downarrow$) of the verbatim ASR decoder outputs with respect to the reference verbatim transcription. Figure (b) shows BLEU scores ($\uparrow$) of the subtitle decoder outputs with respect to the reference subtitles. All pairwise comparisons between the results are statistically significant.