Table of Contents
Fetching ...

Optimizing Estonian TV Subtitles with Semi-supervised Learning and LLMs

Artem Fedorchenko, Tanel Alumäe

TL;DR

The paper addresses Estonian TV subtitle generation by fine-tuning Whisper on human subtitles and enhancing it with iterative pseudo-labeling on a large unlabeled corpus and LLM-based post-editing. It demonstrates that iterative semi-supervised learning improves subtitle quality across SubER and BLEURT metrics, and that applying LLM edits at test time provides stronger gains than applying them during training. The study uses a substantial supervised set and a larger unlabeled corpus, evaluates with SubER and BLEURT variants, and shows GPT-4o offers benefits in post-editing at inference but not for training-time corrections. The work suggests a practical pathway to near-human subtitle quality and potential real-time applications, with future work focusing on real-time adaptation and robustness.

Abstract

This paper presents an approach for generating high-quality, same-language subtitles for Estonian TV content. We fine-tune the Whisper model on human-generated Estonian subtitles and enhance it with iterative pseudo-labeling and large language model (LLM) based post-editing. Our experiments demonstrate notable subtitle quality improvement through pseudo-labeling with an unlabeled dataset. We find that applying LLM-based editing at test time enhances subtitle accuracy, while its use during training does not yield further gains. This approach holds promise for creating subtitle quality close to human standard and could be extended to real-time applications.

Optimizing Estonian TV Subtitles with Semi-supervised Learning and LLMs

TL;DR

The paper addresses Estonian TV subtitle generation by fine-tuning Whisper on human subtitles and enhancing it with iterative pseudo-labeling on a large unlabeled corpus and LLM-based post-editing. It demonstrates that iterative semi-supervised learning improves subtitle quality across SubER and BLEURT metrics, and that applying LLM edits at test time provides stronger gains than applying them during training. The study uses a substantial supervised set and a larger unlabeled corpus, evaluates with SubER and BLEURT variants, and shows GPT-4o offers benefits in post-editing at inference but not for training-time corrections. The work suggests a practical pathway to near-human subtitle quality and potential real-time applications, with future work focusing on real-time adaptation and robustness.

Abstract

This paper presents an approach for generating high-quality, same-language subtitles for Estonian TV content. We fine-tune the Whisper model on human-generated Estonian subtitles and enhance it with iterative pseudo-labeling and large language model (LLM) based post-editing. Our experiments demonstrate notable subtitle quality improvement through pseudo-labeling with an unlabeled dataset. We find that applying LLM-based editing at test time enhances subtitle accuracy, while its use during training does not yield further gains. This approach holds promise for creating subtitle quality close to human standard and could be extended to real-time applications.
Paper Structure (11 sections, 1 equation, 2 figures, 2 tables)

This paper contains 11 sections, 1 equation, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Pseudo-labels generated by model are either passed through LLM or used as as is.
  • Figure 2: Example of an LLM instruction used for refining Estonian subtitles. The model corrected the spelling of the TV show name "Kahekõne" and the historical place name "Toompea" in Estonia.