Table of Contents
Fetching ...

Face-to-Face: A Video Dataset for Multi-Person Interaction Modeling

Ernie Chu, Vishal M. Patel

Abstract

Modeling the reactive tempo of human conversation remains difficult because most audio-visual datasets portray isolated speakers delivering short monologues. We introduce \textbf{Face-to-Face with Jimmy Fallon (F2F-JF)}, a 70-hour, 14k-clip dataset of two-person talk-show exchanges that preserves the sequential dependency between a guest turn and the host's response. A semi-automatic pipeline combines multi-person tracking, speech diarization, and lightweight human verification to extract temporally aligned host/guest tracks with tight crops and metadata that are ready for downstream modeling. We showcase the dataset with a reactive, speech-driven digital avatar task in which the host video during $[t_1,t_2]$ is generated from their audio plus the guest's preceding video during $[t_0,t_1]$. Conditioning a MultiTalk-style diffusion model on this cross-person visual context yields small but consistent Emotion-FID and FVD gains while preserving lip-sync quality relative to an audio-only baseline. The dataset, preprocessing recipe, and baseline together provide an end-to-end blueprint for studying dyadic, sequential behavior, which we expand upon throughout the paper. Dataset and code will be made publicly available.

Face-to-Face: A Video Dataset for Multi-Person Interaction Modeling

Abstract

Modeling the reactive tempo of human conversation remains difficult because most audio-visual datasets portray isolated speakers delivering short monologues. We introduce \textbf{Face-to-Face with Jimmy Fallon (F2F-JF)}, a 70-hour, 14k-clip dataset of two-person talk-show exchanges that preserves the sequential dependency between a guest turn and the host's response. A semi-automatic pipeline combines multi-person tracking, speech diarization, and lightweight human verification to extract temporally aligned host/guest tracks with tight crops and metadata that are ready for downstream modeling. We showcase the dataset with a reactive, speech-driven digital avatar task in which the host video during is generated from their audio plus the guest's preceding video during . Conditioning a MultiTalk-style diffusion model on this cross-person visual context yields small but consistent Emotion-FID and FVD gains while preserving lip-sync quality relative to an audio-only baseline. The dataset, preprocessing recipe, and baseline together provide an end-to-end blueprint for studying dyadic, sequential behavior, which we expand upon throughout the paper. Dataset and code will be made publicly available.
Paper Structure (23 sections, 1 equation, 7 figures, 2 tables)

This paper contains 23 sections, 1 equation, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Face-to-Face with Jimmy Fallon (F2F-JF) in one look. (a) Raw talk-show frames show the variety of topics, lighting, and guest identities captured with aligned audio. (b) Cropped faces emphasize demographic diversity and the fact that every guest is paired with the same recurring host. (c) Each host--guest turn is trimmed into a guest-context/host-response pair, which becomes the supervised signal for reactive avatar generation. All panels are sampled from different videos with synchronized speech.
  • Figure 2: Face-to-Face data pipeline. (a) Multi-person tracking slices long episodes into segments that contain exactly two visible people. (b) Speaker diarization finds the recurring host in the audio channel. (c) Vetted host face crops form an embedding reference for visual verification. (d) Frame-level tracking labels the host and guest, producing clean dyadic clips with aligned identities.
  • Figure 3: Histogram of two-person clip lengths after filtering. Most clips fall between 8 and 20 seconds, which ensures enough temporal context for both the guest turn and the host reaction.
  • Figure 4: Sample frames from the Face-to-Face with Jimmy Fallon (F2F-JF) dataset. Each row shows the same host interacting with different guests, highlighting the range of poses, clothing, and conversational topics covered by the dataset.
  • Figure 5: Paired crops for the reactive avatar task. Each triplet shows (left) the guest providing visual context, (middle) the temporal boundary between turns, and (right) the host response generated from the same clip. The crops stay aligned in time and framing, which allows the model to condition on the guest video before synthesizing the host.
  • ...and 2 more figures