Table of Contents
Fetching ...

TalkCuts: A Large-Scale Dataset for Multi-Shot Human Speech Video Generation

Jiaben Chen, Zixin Wang, Ailing Zeng, Yang Fu, Xueyang Yu, Siyuan Cen, Julian Tanke, Yihang Chen, Koichi Saito, Yuki Mitsufuji, Chuang Gan

TL;DR

TalkCuts introduces a large-scale benchmark for long-form, multi-shot human speech video generation, featuring over 164k clips totaling more than 500 hours at 1080p with diverse camera shots and rich multimodal annotations (2D keypoints, 3D SMPL-X) across 10k identities. It also presents Orator, an end-to-end baseline where DirectorLLM guides camera transitions, gestures, and vocal delivery, with SpeechGen and VideoGen delivering synchronized audio and video synthesized from reference images. Across experiments on LLM-guided shot transitions, audio-driven, and pose-guided generation, models trained on TalkCuts achieve improvements in shot coherence, motion quality, and identity preservation compared to existing baselines. The results demonstrate the potential of combining large language models with multimodal generation to control long-form, dynamic videos, and establish TalkCuts as a foundational resource for future multi-shot speech video synthesis research.

Abstract

In this work, we present TalkCuts, a large-scale dataset designed to facilitate the study of multi-shot human speech video generation. Unlike existing datasets that focus on single-shot, static viewpoints, TalkCuts offers 164k clips totaling over 500 hours of high-quality human speech videos with diverse camera shots, including close-up, half-body, and full-body views. The dataset includes detailed textual descriptions, 2D keypoints and 3D SMPL-X motion annotations, covering over 10k identities, enabling multimodal learning and evaluation. As a first attempt to showcase the value of the dataset, we present Orator, an LLM-guided multi-modal generation framework as a simple baseline, where the language model functions as a multi-faceted director, orchestrating detailed specifications for camera transitions, speaker gesticulations, and vocal modulation. This architecture enables the synthesis of coherent long-form videos through our integrated multi-modal video generation module. Extensive experiments in both pose-guided and audio-driven settings show that training on TalkCuts significantly enhances the cinematographic coherence and visual appeal of generated multi-shot speech videos. We believe TalkCuts provides a strong foundation for future work in controllable, multi-shot speech video generation and broader multimodal learning.

TalkCuts: A Large-Scale Dataset for Multi-Shot Human Speech Video Generation

TL;DR

TalkCuts introduces a large-scale benchmark for long-form, multi-shot human speech video generation, featuring over 164k clips totaling more than 500 hours at 1080p with diverse camera shots and rich multimodal annotations (2D keypoints, 3D SMPL-X) across 10k identities. It also presents Orator, an end-to-end baseline where DirectorLLM guides camera transitions, gestures, and vocal delivery, with SpeechGen and VideoGen delivering synchronized audio and video synthesized from reference images. Across experiments on LLM-guided shot transitions, audio-driven, and pose-guided generation, models trained on TalkCuts achieve improvements in shot coherence, motion quality, and identity preservation compared to existing baselines. The results demonstrate the potential of combining large language models with multimodal generation to control long-form, dynamic videos, and establish TalkCuts as a foundational resource for future multi-shot speech video synthesis research.

Abstract

In this work, we present TalkCuts, a large-scale dataset designed to facilitate the study of multi-shot human speech video generation. Unlike existing datasets that focus on single-shot, static viewpoints, TalkCuts offers 164k clips totaling over 500 hours of high-quality human speech videos with diverse camera shots, including close-up, half-body, and full-body views. The dataset includes detailed textual descriptions, 2D keypoints and 3D SMPL-X motion annotations, covering over 10k identities, enabling multimodal learning and evaluation. As a first attempt to showcase the value of the dataset, we present Orator, an LLM-guided multi-modal generation framework as a simple baseline, where the language model functions as a multi-faceted director, orchestrating detailed specifications for camera transitions, speaker gesticulations, and vocal modulation. This architecture enables the synthesis of coherent long-form videos through our integrated multi-modal video generation module. Extensive experiments in both pose-guided and audio-driven settings show that training on TalkCuts significantly enhances the cinematographic coherence and visual appeal of generated multi-shot speech videos. We believe TalkCuts provides a strong foundation for future work in controllable, multi-shot speech video generation and broader multimodal learning.

Paper Structure

This paper contains 13 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of the TalkCuts Dataset. The dataset features (1) diverse camera shot types (e.g., close-up, half-body, full-body), (2) annotations for 2D keypoints and 3D SMPL-X motion, and (3) a wide range of speaker identities spanning various ethnicities, body types, and age groups.
  • Figure 2: Multi-shot speech video generation. We propose Orator, a fully automated system that generates human speech videos with dynamic camera shots. By organically integrating multiple modules, a DirectorLLM directs camera transitions, gestures, and audio instructions, delivering coherent and engaging multi-shot speech videos.
  • Figure 3: Pipeline of Orator. The DirectorLLM processes the input script to generate instructions for camera shots, motion, and audio. These guide the multi-modal video generation model to produce the final long-form speech video with natural transitions and gestures.
  • Figure 4: Qualitative Comparison of Human Video Generation Results. We compare our results with baseline models across close-up, medium, and full-body shots. Artifacts in baseline outputs, such as facial distortions, motion blur, mismatched hand gestures, and lip-sync inconsistencies, are highlighted with arrows and bounding boxes. Our model produces more realistic results across all shot types, maintaining visual fidelity and smoother transitions compared to the baselines.