SBAAM! Eliminating Transcript Dependency in Automatic Subtitling
Marco Gaido, Sara Papi, Matteo Negri, Mauro Cettolo, Luisa Bentivogli
TL;DR
This work tackles the challenge of automatic subtitling without relying on intermediate transcripts by proposing a direct end-to-end model that generates subtitles and directly predicts their timestamps. It introduces two timestamp estimation approaches—a Subtitle CTC-based method and an attention-based method including DTW and the SBAAM algorithm—together with SubSONAR, a timing-sensitive evaluation metric. Across seven language pairs and diverse domains, the approach achieves new state-of-the-art results and narrows or closes the gap with cascade systems, with manual evaluation confirming substantial reductions in timestamp edits and shifts. The findings demonstrate the practical viability of transcription-free subtitling and highlight SubSONAR as a focused tool for assessing temporal alignment and timing quality.
Abstract
Subtitling plays a crucial role in enhancing the accessibility of audiovisual content and encompasses three primary subtasks: translating spoken dialogue, segmenting translations into concise textual units, and estimating timestamps that govern their on-screen duration. Past attempts to automate this process rely, to varying degrees, on automatic transcripts, employed diversely for the three subtasks. In response to the acknowledged limitations associated with this reliance on transcripts, recent research has shifted towards transcription-free solutions for translation and segmentation, leaving the direct generation of timestamps as uncharted territory. To fill this gap, we introduce the first direct model capable of producing automatic subtitles, entirely eliminating any dependence on intermediate transcripts also for timestamp prediction. Experimental results, backed by manual evaluation, showcase our solution's new state-of-the-art performance across multiple language pairs and diverse conditions.
