Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling
Bruno Korbar, Jaesung Huh, Andrew Zisserman
TL;DR
This work defines and tackles character-aware subtitling, an automatic SDH-like task that identifies who spoke what and when, using an audio-visual pipeline that does not rely on face detection or tracking. It introduces a two-stage approach: first building high-precision per-character speech exemplars from audio-visual cues, then assigning all speech segments to character identities via centroid-based embedding similarity, with an explicit unknown option controlled by a distance threshold. The authors contribute a dataset built from three sitcoms with ground-truth speaker identities and timestamps, and demonstrate competitive performance with a detailed analysis of Stage 1 and Stage 2, including transcription quality via WhisperX. The work advances practical subtitle generation for accessibility and supports large-scale data creation for video-language research, while acknowledging limitations with short utterances and overlapping speech. Overall, the method offers a scalable, face-detection-free solution for automatically generating character-aware subtitles with precise timing.
Abstract
The goal of this paper is automatic character-aware subtitle generation. Given a video and a minimal amount of metadata, we propose an audio-visual method that generates a full transcript of the dialogue, with precise speech timestamps, and the character speaking identified. The key idea is to first use audio-visual cues to select a set of high-precision audio exemplars for each character, and then use these exemplars to classify all speech segments by speaker identity. Notably, the method does not require face detection or tracking. We evaluate the method over a variety of TV sitcoms, including Seinfeld, Fraiser and Scrubs. We envision this system being useful for the automatic generation of subtitles to improve the accessibility of the vast amount of videos available on modern streaming services. Project page : \url{https://www.robots.ox.ac.uk/~vgg/research/look-listen-recognise/}
