Table of Contents
Fetching ...

UniTalk: Towards Universal Active Speaker Detection in Real World Scenarios

Le Thien Phuc Nguyen, Zhuoran Yu, Khoa Quang Nhat Cao, Yuwei Guo, Tu Ho Manh Pham, Tuan Tai Nguyen, Toan Ngo Duc Vo, Lucas Poon, Soochahn Lee, Yong Jae Lee

TL;DR

UniTalk introduces a large-scale, real-world ASD benchmark that emphasizes language diversity, background noise, and crowded scenes to address the limitations of AVA. By combining a rigorous data-curation pipeline, dense frame-level annotations, and diagnostic evaluation subsets, UniTalk reveals substantial headroom for improvement in current ASD models. Experiments show that state-of-the-art models trained on traditional benchmarks underperform on UniTalk, yet UniTalk-trained models generalize better across other real-world datasets and serve as a strong pretraining source for AVA and related tasks. The work advances ASD research by providing a more representative evaluation framework and practical pretraining data for robust, cross-domain speaker detection.

Abstract

We present UniTalk, a novel dataset specifically designed for the task of active speaker detection, emphasizing challenging scenarios to enhance model generalization. Unlike previously established benchmarks such as AVA, which predominantly features old movies and thus exhibits significant domain gaps, UniTalk focuses explicitly on diverse and difficult real-world conditions. These include underrepresented languages, noisy backgrounds, and crowded scenes - such as multiple visible speakers speaking concurrently or in overlapping turns. It contains over 44.5 hours of video with frame-level active speaker annotations across 48,693 speaking identities, and spans a broad range of video types that reflect real-world conditions. Through rigorous evaluation, we show that state-of-the-art models, while achieving nearly perfect scores on AVA, fail to reach saturation on UniTalk, suggesting that the ASD task remains far from solved under realistic conditions. Nevertheless, models trained on UniTalk demonstrate stronger generalization to modern "in-the-wild" datasets like Talkies and ASW, as well as to AVA. UniTalk thus establishes a new benchmark for active speaker detection, providing researchers with a valuable resource for developing and evaluating versatile and resilient models. Dataset: https://huggingface.co/datasets/plnguyen2908/UniTalk-ASD Code: https://github.com/plnguyen2908/UniTalk-ASD-code

UniTalk: Towards Universal Active Speaker Detection in Real World Scenarios

TL;DR

UniTalk introduces a large-scale, real-world ASD benchmark that emphasizes language diversity, background noise, and crowded scenes to address the limitations of AVA. By combining a rigorous data-curation pipeline, dense frame-level annotations, and diagnostic evaluation subsets, UniTalk reveals substantial headroom for improvement in current ASD models. Experiments show that state-of-the-art models trained on traditional benchmarks underperform on UniTalk, yet UniTalk-trained models generalize better across other real-world datasets and serve as a strong pretraining source for AVA and related tasks. The work advances ASD research by providing a more representative evaluation framework and practical pretraining data for robust, cross-domain speaker detection.

Abstract

We present UniTalk, a novel dataset specifically designed for the task of active speaker detection, emphasizing challenging scenarios to enhance model generalization. Unlike previously established benchmarks such as AVA, which predominantly features old movies and thus exhibits significant domain gaps, UniTalk focuses explicitly on diverse and difficult real-world conditions. These include underrepresented languages, noisy backgrounds, and crowded scenes - such as multiple visible speakers speaking concurrently or in overlapping turns. It contains over 44.5 hours of video with frame-level active speaker annotations across 48,693 speaking identities, and spans a broad range of video types that reflect real-world conditions. Through rigorous evaluation, we show that state-of-the-art models, while achieving nearly perfect scores on AVA, fail to reach saturation on UniTalk, suggesting that the ASD task remains far from solved under realistic conditions. Nevertheless, models trained on UniTalk demonstrate stronger generalization to modern "in-the-wild" datasets like Talkies and ASW, as well as to AVA. UniTalk thus establishes a new benchmark for active speaker detection, providing researchers with a valuable resource for developing and evaluating versatile and resilient models. Dataset: https://huggingface.co/datasets/plnguyen2908/UniTalk-ASD Code: https://github.com/plnguyen2908/UniTalk-ASD-code

Paper Structure

This paper contains 22 sections, 1 equation, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Comparison between AVA and UniTalk. AVA roth2020ava primarily consists of movie content often with clean audio and simple visual composition. It also includes dubbed videos, where the audio is artificially overlaid and may not align with visible speech, potentially limiting the reliability of audiovisual supervision. In contrast, UniTalk features diverse real-world scenarios, including crowded scenes, underrepresented languages, noisy backgrounds, and combinations thereof. Each row shows a representative clip from a subcategory in UniTalk, with icons indicating language, noise level, and visual complexity.
  • Figure 2: Data curation pipeline. Our data curation pipeline consists of four distinct stages: (1) video sourcing to construct an initial pool of candidate clips, (2) content filtering to remove videos containing sensitive or inappropriate material, (3) face track generation to convert raw videos into structured face sequences, and (4) annotation and storage for benchmark use.
  • Figure 3: Language distribution in UniTalk vs. AVA.UniTalk covers a wider range of languages, particularly with stronger representation of East Asian languages e.g., Chinese, Korean, and Japanese. In contrast, AVA primarily consists of Indo-European languages, limiting its linguistic diversity.
  • Figure 4: Dataset Composition Overview. (a) Race distribution of visible speakers. (b) Number of visible faces per frame, reflecting the range of visual complexity. (c) Breakdown of test set according to targeted difficulty categories used for evaluation.
  • Figure A: Difficulty space of candidate video search terms. Each point represents a YouTube keyword query, plotted by the average number of faces per frame (x-axis, visual complexity) and average background noise level (y-axis, measured via RMS after VAD). We highlight three shaded regions corresponding to different axes of difficulty: crowded scenes (high visual complexity, bottom right), noisy backgrounds (high acoustic complexity, top left), and hard examples (both high visual and acoustic complexity, top right).
  • ...and 3 more figures