Table of Contents
Fetching ...

Character-aware audio-visual subtitling in context

Jaesung Huh, Andrew Zisserman

TL;DR

An improved framework for character-aware audio-visual subtitling in TV shows that overcomes a limitation of existing methods that they are unable to accurately assign speakers to short temporal segments and shows that the speaker of short segments can be determined by using the temporal context of the dialogue within a scene.

Abstract

This paper presents an improved framework for character-aware audio-visual subtitling in TV shows. Our approach integrates speech recognition, speaker diarisation, and character recognition, utilising both audio and visual cues. This holistic solution addresses what is said, when it's said, and who is speaking, providing a more comprehensive and accurate character-aware subtitling for TV shows. Our approach brings improvements on two fronts: first, we show that audio-visual synchronisation can be used to pick out the talking face amongst others present in a video clip, and assign an identity to the corresponding speech segment. This audio-visual approach improves recognition accuracy and yield over current methods. Second, we show that the speaker of short segments can be determined by using the temporal context of the dialogue within a scene. We propose an approach using local voice embeddings of the audio, and large language model reasoning on the text transcription. This overcomes a limitation of existing methods that they are unable to accurately assign speakers to short temporal segments. We validate the method on a dataset with 12 TV shows, demonstrating superior performance in speaker diarisation and character recognition accuracy compared to existing approaches. Project page : https://www.robots.ox.ac.uk/~vgg/research/llr-context/

Character-aware audio-visual subtitling in context

TL;DR

An improved framework for character-aware audio-visual subtitling in TV shows that overcomes a limitation of existing methods that they are unable to accurately assign speakers to short temporal segments and shows that the speaker of short segments can be determined by using the temporal context of the dialogue within a scene.

Abstract

This paper presents an improved framework for character-aware audio-visual subtitling in TV shows. Our approach integrates speech recognition, speaker diarisation, and character recognition, utilising both audio and visual cues. This holistic solution addresses what is said, when it's said, and who is speaking, providing a more comprehensive and accurate character-aware subtitling for TV shows. Our approach brings improvements on two fronts: first, we show that audio-visual synchronisation can be used to pick out the talking face amongst others present in a video clip, and assign an identity to the corresponding speech segment. This audio-visual approach improves recognition accuracy and yield over current methods. Second, we show that the speaker of short segments can be determined by using the temporal context of the dialogue within a scene. We propose an approach using local voice embeddings of the audio, and large language model reasoning on the text transcription. This overcomes a limitation of existing methods that they are unable to accurately assign speakers to short temporal segments. We validate the method on a dataset with 12 TV shows, demonstrating superior performance in speaker diarisation and character recognition accuracy compared to existing approaches. Project page : https://www.robots.ox.ac.uk/~vgg/research/llr-context/

Paper Structure

This paper contains 38 sections, 7 figures, 8 tables.

Figures (7)

  • Figure 1: An example video clip and output of our method. Dialogues in TV shows typically flow continuously, and speaker identities can often be inferred from the content and context of the conversation. In some cases, it's possible to diarise speakers solely based on textual context. Even though we cannot see the speaker visually -- so have no evidence from lip-movement -- we can infer that the utterance with a question mark (?) belongs to 'Niles' by looking at temporal context of the dialogue.
  • Figure 2: Assigning speakers to short audio segments. First, we use speaker embeddings from nearby segments where we have high confidence in speaker identification (left). Second, we employ a Large Language Model (LLM) to determine the speaker based on the content of conversation. (right)
  • Figure 3: The visual prediction process for a speech segment. Visible speakers with lip movements synchronised with the speech audio are recognised by using a visual embedding from the castlist. This assigns an identity to the corresponding speaker.
  • Figure 4: A schematic overview of our pipeline. We first extract the audio exemplars from videos (top) and use them label all audio segments (bottom).
  • Figure 5: Distribution of segment lengths on LLR-TV and Bazinga!-gold-TV.
  • ...and 2 more figures