Table of Contents
Fetching ...

FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning

Weijie Lyu, Ming-Hsuan Yang, Zhixin Shu

TL;DR

A face-tailored scale-aware representation for camera transformations that provides deterministic conditioning without relying on 3D priors is proposed, and FaceCam, a system that generates video under customizable camera trajectories for monocular human portrait video input is introduced.

Abstract

We introduce FaceCam, a system that generates video under customizable camera trajectories for monocular human portrait video input. Recent camera control approaches based on large video-generation models have shown promising progress but often exhibit geometric distortions and visual artifacts on portrait videos due to scale-ambiguous camera representations or 3D reconstruction errors. To overcome these limitations, we propose a face-tailored scale-aware representation for camera transformations that provides deterministic conditioning without relying on 3D priors. We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-shot stitching, to exploit stationary training cameras while generalizing to dynamic, continuous camera trajectories at inference time. Experiments on Ava-256 dataset and diverse in-the-wild videos demonstrate that FaceCam achieves superior performance in camera controllability, visual quality, identity and motion preservation.

FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning

TL;DR

A face-tailored scale-aware representation for camera transformations that provides deterministic conditioning without relying on 3D priors is proposed, and FaceCam, a system that generates video under customizable camera trajectories for monocular human portrait video input is introduced.

Abstract

We introduce FaceCam, a system that generates video under customizable camera trajectories for monocular human portrait video input. Recent camera control approaches based on large video-generation models have shown promising progress but often exhibit geometric distortions and visual artifacts on portrait videos due to scale-ambiguous camera representations or 3D reconstruction errors. To overcome these limitations, we propose a face-tailored scale-aware representation for camera transformations that provides deterministic conditioning without relying on 3D priors. We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-shot stitching, to exploit stationary training cameras while generalizing to dynamic, continuous camera trajectories at inference time. Experiments on Ava-256 dataset and diverse in-the-wild videos demonstrate that FaceCam achieves superior performance in camera controllability, visual quality, identity and motion preservation.
Paper Structure (28 sections, 16 equations, 11 figures, 4 tables, 3 algorithms)

This paper contains 28 sections, 16 equations, 11 figures, 4 tables, 3 algorithms.

Figures (11)

  • Figure 1: FaceCam generates portrait videos with precise camera control from a single input video and a target camera trajectory. We introduce scale-aware camera conditioning that represents the target camera via rendered facial landmarks, enabling accurate camera pose control. Our approach preserves subject identity and motion while maintaining high visual quality. Project page: https://weijielyu.github.io/FaceCam.
  • Figure 2: Camera representation comparison. We contrast (A) parameter-based representations, which are standard in camera control methods, with (B) image-space point correspondences, which we adopt in FaceCam to obtain a scale-aware conditioning that enables precise camera control.
  • Figure 3: Training and inference pipeline of FaceCam.
  • Figure 4: Training data generation examples. The source video is applied with scale and color augmentation to increase data diversity, while the target video is augmented with all three types to train the model’s camera control capability.
  • Figure 5: Qualitative results on Ava-256.FaceCam produces more realistic, ground-truth-aligned novel views than baselines. ReCamMaster recammaster often fails under large pose changes, pushing the head out of frame, while TrajectoryCrafter trajectorycrafter frequently shows facial distortions from dynamic point-cloud errors.
  • ...and 6 more figures