Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling

Bruno Korbar; Jaesung Huh; Andrew Zisserman

Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling

Bruno Korbar, Jaesung Huh, Andrew Zisserman

TL;DR

This work defines and tackles character-aware subtitling, an automatic SDH-like task that identifies who spoke what and when, using an audio-visual pipeline that does not rely on face detection or tracking. It introduces a two-stage approach: first building high-precision per-character speech exemplars from audio-visual cues, then assigning all speech segments to character identities via centroid-based embedding similarity, with an explicit unknown option controlled by a distance threshold. The authors contribute a dataset built from three sitcoms with ground-truth speaker identities and timestamps, and demonstrate competitive performance with a detailed analysis of Stage 1 and Stage 2, including transcription quality via WhisperX. The work advances practical subtitle generation for accessibility and supports large-scale data creation for video-language research, while acknowledging limitations with short utterances and overlapping speech. Overall, the method offers a scalable, face-detection-free solution for automatically generating character-aware subtitles with precise timing.

Abstract

The goal of this paper is automatic character-aware subtitle generation. Given a video and a minimal amount of metadata, we propose an audio-visual method that generates a full transcript of the dialogue, with precise speech timestamps, and the character speaking identified. The key idea is to first use audio-visual cues to select a set of high-precision audio exemplars for each character, and then use these exemplars to classify all speech segments by speaker identity. Notably, the method does not require face detection or tracking. We evaluate the method over a variety of TV sitcoms, including Seinfeld, Fraiser and Scrubs. We envision this system being useful for the automatic generation of subtitles to improve the accessibility of the vast amount of videos available on modern streaming services. Project page : \url{https://www.robots.ox.ac.uk/~vgg/research/look-listen-recognise/}

Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling

TL;DR

Abstract

Paper Structure (13 sections, 4 figures, 5 tables)

This paper contains 13 sections, 4 figures, 5 tables.

Introduction
Related work
Method
Stage 1: building audio exemplars
Stage 2: Assigning characters to speech segments
Implementation details
Evaluation Dataset
Annotation procedure
Dataset statistics
Results
Detailed analysis of Stage 1 and 2
Overall performance on the test set
Conclusions

Figures (4)

Figure 1: Character-aware audio-visual subtitling. The generated data covers what is said, when it said, and by whom it is said.
Figure 2: Overview of our method. We first build a database of audio exemplars for each character by filtering speech segments until only a high precision set remains (left). Each speech segment is then assigned to a character by comparing its voice embedding to the exemplar embeddings (right).
Figure 3: Stage 2 Precision-POCS Curves for the test set of the three TV series, obtained by varying the threshold $d$ (for classification as "unknown"). The left figure shows the performance using all detected speech segments. The right figure shows the performance only for the long segments ($>$ 2 sec). We also show the oracle points (‘x’ in each graph) for each TV series. The oracle point is where all segments for which there are character exemplars are correctly classified, and other segments are classified as "unknown".
Figure 4: Qualitative example. Our method produces the speech segments with timestamps, and assigns the character who spoke it.

Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling

TL;DR

Abstract

Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling

Authors

TL;DR

Abstract

Table of Contents

Figures (4)