Face-to-Face: A Video Dataset for Multi-Person Interaction Modeling

Ernie Chu; Vishal M. Patel

Face-to-Face: A Video Dataset for Multi-Person Interaction Modeling

Ernie Chu, Vishal M. Patel

Abstract

Modeling the reactive tempo of human conversation remains difficult because most audio-visual datasets portray isolated speakers delivering short monologues. We introduce \textbf{Face-to-Face with Jimmy Fallon (F2F-JF)}, a 70-hour, 14k-clip dataset of two-person talk-show exchanges that preserves the sequential dependency between a guest turn and the host's response. A semi-automatic pipeline combines multi-person tracking, speech diarization, and lightweight human verification to extract temporally aligned host/guest tracks with tight crops and metadata that are ready for downstream modeling. We showcase the dataset with a reactive, speech-driven digital avatar task in which the host video during $[t_1,t_2]$ is generated from their audio plus the guest's preceding video during $[t_0,t_1]$. Conditioning a MultiTalk-style diffusion model on this cross-person visual context yields small but consistent Emotion-FID and FVD gains while preserving lip-sync quality relative to an audio-only baseline. The dataset, preprocessing recipe, and baseline together provide an end-to-end blueprint for studying dyadic, sequential behavior, which we expand upon throughout the paper. Dataset and code will be made publicly available.

Face-to-Face: A Video Dataset for Multi-Person Interaction Modeling

Abstract

is generated from their audio plus the guest's preceding video during

. Conditioning a MultiTalk-style diffusion model on this cross-person visual context yields small but consistent Emotion-FID and FVD gains while preserving lip-sync quality relative to an audio-only baseline. The dataset, preprocessing recipe, and baseline together provide an end-to-end blueprint for studying dyadic, sequential behavior, which we expand upon throughout the paper. Dataset and code will be made publicly available.

Paper Structure (23 sections, 1 equation, 7 figures, 2 tables)

This paper contains 23 sections, 1 equation, 7 figures, 2 tables.

Introduction
Related Work
Dataset Construction Pipeline
(a) Initial Vetting.
(b) Host Localization from Audio.
(c) Host Face Modeling.
(d) Identity Tracking and Assignment.
Dataset Outcome.
Implementation Details.
Reactive Digital Avatar Generation Dataset
Clip filtering.
Canonical crops.
Synchronized renders and metadata.
Reactive baseline.
Digital Avatar Generator
...and 8 more sections

Figures (7)

Figure 1: Face-to-Face with Jimmy Fallon (F2F-JF) in one look. (a) Raw talk-show frames show the variety of topics, lighting, and guest identities captured with aligned audio. (b) Cropped faces emphasize demographic diversity and the fact that every guest is paired with the same recurring host. (c) Each host--guest turn is trimmed into a guest-context/host-response pair, which becomes the supervised signal for reactive avatar generation. All panels are sampled from different videos with synchronized speech.
Figure 2: Face-to-Face data pipeline. (a) Multi-person tracking slices long episodes into segments that contain exactly two visible people. (b) Speaker diarization finds the recurring host in the audio channel. (c) Vetted host face crops form an embedding reference for visual verification. (d) Frame-level tracking labels the host and guest, producing clean dyadic clips with aligned identities.
Figure 3: Histogram of two-person clip lengths after filtering. Most clips fall between 8 and 20 seconds, which ensures enough temporal context for both the guest turn and the host reaction.
Figure 4: Sample frames from the Face-to-Face with Jimmy Fallon (F2F-JF) dataset. Each row shows the same host interacting with different guests, highlighting the range of poses, clothing, and conversational topics covered by the dataset.
Figure 5: Paired crops for the reactive avatar task. Each triplet shows (left) the guest providing visual context, (middle) the temporal boundary between turns, and (right) the host response generated from the same clip. The crops stay aligned in time and framing, which allows the model to condition on the guest video before synthesizing the host.
...and 2 more figures

Face-to-Face: A Video Dataset for Multi-Person Interaction Modeling

Abstract

Face-to-Face: A Video Dataset for Multi-Person Interaction Modeling

Authors

Abstract

Table of Contents

Figures (7)