Table of Contents
Fetching ...

SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

Youliang Zhang, Zhaoyang Li, Duomin Wang, Jiahe Zhang, Deyu Zhou, Zixin Yin, Xili Dai, Gang Yu, Xiu Li

TL;DR

SpeakerVid-5M introduces a first-ever large-scale, high-quality dataset for audio-visual dyadic interactive virtual humans, encompassing 8.7K hours of single-speaker video and 1.8K hours of two-person dialogue across 5.2M clips, with rich multi-modal annotations and a dual-tier data design for pretraining and supervised fine-tuning. The paper also presents VidChatBench, a 500-pair benchmark for evaluating video quality, identity preservation, dialogue coherence, audio-visual synchronization, and emotional alignment, along with an autoregressive baseline that jointly generates audio and video conditioned on A/V inputs. Key contributions include the multi-branch dataset architecture (dialogue, listening, multi-turn), rigorous quality filtering, and detailed annotation pipelines (l_score, captions, ASR, scene, speaker, and ASR metadata). Empirical results show the dyadic setup yields superior coherence and quality, with ablations confirming the value of the spatial transformer and noise-injection strategy, and the authors provide open-source data and tools to enable reproducibility and further research.

Abstract

The rapid development of large-scale models has catalyzed significant breakthroughs in the digital human domain. These advanced methodologies offer high-fidelity solutions for avatar driving and rendering, leading academia to focus on the next major challenge: audio-visual dyadic interactive virtual human. To facilitate research in this emerging area, we present SpeakerVid-5M dataset, the first large-scale, high-quality dataset designed for audio-visual dyadic interactive virtual human generation. Totaling over 8,743 hours, SpeakerVid-5M contains more than 5.2 million video clips of human portraits. It covers diverse scales and interaction types, including monadic talking, listening, and dyadic conversations. Crucially, the dataset is structured along two key dimensions: interaction type and data quality. First, it is categorized into four types (dialogue branch, single branch, listening branch and multi-turn branch) based on the interaction scenario. Second, it is stratified into a large-scale pre-training subset and a curated, high-quality subset for Supervised Fine-Tuning (SFT). This dual structure accommodates a wide array of 2D virtual human tasks. In addition, we provide an autoregressive (AR)-based video chat baseline trained on this data, accompanied by a dedicated set of metrics and test data to serve as a benchmark VidChatBench for future work. Both the dataset and the corresponding data processing code will be publicly released. Project page: https://dorniwang.github.io/SpeakerVid-5M/

SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

TL;DR

SpeakerVid-5M introduces a first-ever large-scale, high-quality dataset for audio-visual dyadic interactive virtual humans, encompassing 8.7K hours of single-speaker video and 1.8K hours of two-person dialogue across 5.2M clips, with rich multi-modal annotations and a dual-tier data design for pretraining and supervised fine-tuning. The paper also presents VidChatBench, a 500-pair benchmark for evaluating video quality, identity preservation, dialogue coherence, audio-visual synchronization, and emotional alignment, along with an autoregressive baseline that jointly generates audio and video conditioned on A/V inputs. Key contributions include the multi-branch dataset architecture (dialogue, listening, multi-turn), rigorous quality filtering, and detailed annotation pipelines (l_score, captions, ASR, scene, speaker, and ASR metadata). Empirical results show the dyadic setup yields superior coherence and quality, with ablations confirming the value of the spatial transformer and noise-injection strategy, and the authors provide open-source data and tools to enable reproducibility and further research.

Abstract

The rapid development of large-scale models has catalyzed significant breakthroughs in the digital human domain. These advanced methodologies offer high-fidelity solutions for avatar driving and rendering, leading academia to focus on the next major challenge: audio-visual dyadic interactive virtual human. To facilitate research in this emerging area, we present SpeakerVid-5M dataset, the first large-scale, high-quality dataset designed for audio-visual dyadic interactive virtual human generation. Totaling over 8,743 hours, SpeakerVid-5M contains more than 5.2 million video clips of human portraits. It covers diverse scales and interaction types, including monadic talking, listening, and dyadic conversations. Crucially, the dataset is structured along two key dimensions: interaction type and data quality. First, it is categorized into four types (dialogue branch, single branch, listening branch and multi-turn branch) based on the interaction scenario. Second, it is stratified into a large-scale pre-training subset and a curated, high-quality subset for Supervised Fine-Tuning (SFT). This dual structure accommodates a wide array of 2D virtual human tasks. In addition, we provide an autoregressive (AR)-based video chat baseline trained on this data, accompanied by a dedicated set of metrics and test data to serve as a benchmark VidChatBench for future work. Both the dataset and the corresponding data processing code will be publicly released. Project page: https://dorniwang.github.io/SpeakerVid-5M/

Paper Structure

This paper contains 30 sections, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Overview of the audio-visual dyadic generation task and the SpeakerVid-5M dataset. The primary task (top row) is to generate a coherent audio-visual response based on the input of initiator. Our SpeakerVid-5M (bottom left) provides over 8.7K hours data to facilitate this research. Each clip is enriched with detailed multi-modal annotations (right panel), enabling fine-grained generation.
  • Figure 2: The SpeakerVid-5M curation pipeline. The process consists: (1) Source data collection from YouTube; (2) Multi-step audio-visual pre-processing; (3) Rich multi-modal annotation using models like Qwen-VL; (4) Rigorous quality filtering stage for data fidelity.
  • Figure 3: Statistics of our dataset from multiple aspects, including blur score, sync score, caption, etc.
  • Figure 4: Examples of dyadic dialogue and body composition in SpeakerVid-5M. The top rows illustrate a typical dyadic human generation sample (initiator and responder). The bottom rows demonstrate the variety of body compositions annotated in our dataset, including close-up headshots, half-body, and full-body views, which are critical for controllable generation.
  • Figure 5: Our autoregressive audio-visual generation method.
  • ...and 5 more figures