Table of Contents
Fetching ...

A Near-Raw Talking-Head Video Dataset for Various Computer Vision Tasks

Babak Naderi, Ross Cutler

Abstract

Talking-head videos constitute a predominant content type in real-time communication, yet publicly available datasets for video processing research in this domain remain scarce and limited in signal fidelity. In this paper, we open-source a near-raw dataset of 847 talking-head recordings (approximately 212 minutes), each 15\,s in duration, captured from 805 participants using 446 unique consumer webcam devices in their natural environments. All recordings are stored using the FFV1 lossless codec, preserving the camera-native signal -- uncompressed (24.4\%) or MJPEG-encoded (75.6\%) -- without additional lossy processing. Each recording is annotated with a Mean Opinion Score (MOS) and ten perceptual quality tokens that jointly explain 64.4\% of the MOS variance. From this corpus, we curate a stratified benchmarking subset of 120 clips in three content conditions: original, background blur, and background replacement. Codec efficiency evaluation across four datasets and four codecs, namely H.264, H.265, H.266, and AV1, yields VMAF BD-rate savings up to $-71.3\%$ (H.266) relative to H.264, with significant encoder$\times$dataset ($η_p^2 = .112$) and encoder$\times$content condition ($η_p^2 = .149$) interactions, demonstrating that both content type and background processing affect compression efficiency. The dataset offers 5$\times$ the scale of the largest prior talking-head webcam dataset (847 vs.\ 160 clips) with lossless signal fidelity, establishing a resource for training and benchmarking video compression and enhancement models in real-time communication.

A Near-Raw Talking-Head Video Dataset for Various Computer Vision Tasks

Abstract

Talking-head videos constitute a predominant content type in real-time communication, yet publicly available datasets for video processing research in this domain remain scarce and limited in signal fidelity. In this paper, we open-source a near-raw dataset of 847 talking-head recordings (approximately 212 minutes), each 15\,s in duration, captured from 805 participants using 446 unique consumer webcam devices in their natural environments. All recordings are stored using the FFV1 lossless codec, preserving the camera-native signal -- uncompressed (24.4\%) or MJPEG-encoded (75.6\%) -- without additional lossy processing. Each recording is annotated with a Mean Opinion Score (MOS) and ten perceptual quality tokens that jointly explain 64.4\% of the MOS variance. From this corpus, we curate a stratified benchmarking subset of 120 clips in three content conditions: original, background blur, and background replacement. Codec efficiency evaluation across four datasets and four codecs, namely H.264, H.265, H.266, and AV1, yields VMAF BD-rate savings up to (H.266) relative to H.264, with significant encoderdataset () and encodercontent condition () interactions, demonstrating that both content type and background processing affect compression efficiency. The dataset offers 5 the scale of the largest prior talking-head webcam dataset (847 vs.\ 160 clips) with lossless signal fidelity, establishing a resource for training and benchmarking video compression and enhancement models in real-time communication.

Paper Structure

This paper contains 21 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Cumulative distribution function of MOS values across the 847 published clips.
  • Figure 2: Thumbnail atlases of the three benchmarking groups. (a) Talking Head, (b) Talking Head - Background Blur, (c) Talking Head - Background Replacement
  • Figure 3: Distribution of clips in the SI--TI space for each benchmarking group, color-coded by MOS category (green: High, blue: Medium, red: Low). Dashed lines indicate population medians.
  • Figure 4: Rate--distortion curves for the NR-TH benchmarking subset (\ref{['fig:rd_psnr']}--\ref{['fig:rd_vmaf']}) and VMAF rate--distortion curves for VCD (\ref{['fig:rd_vmaf2']}) and HEVC (\ref{['fig:rd_vmaf3']}) datasets. Shaded bands indicate 95% confidence intervals.
  • Figure 5: Rate--distortion curves per benchmarking group. Top row: PSNR vs. bpp; bottom row: VMAF vs. bpp. Each curve shows the mean metric value across clips in the group.