Table of Contents
Fetching ...

ActAnywhere: Subject-Aware Video Background Generation

Boxiao Pan, Zhan Xu, Chun-Hao Paul Huang, Krishna Kumar Singh, Yang Zhou, Leonidas J. Guibas, Jimei Yang

TL;DR

ActAnywhere tackles automated subject-aware video background generation by conditioning a latent diffusion model on a foreground subject sequence and a single background frame. It encodes foreground masks and frames into latents and uses cross-frame attention with CLIP-conditioned background guidance to synthesize temporally coherent videos that follow the subject’s motion while matching the constraint frame. Trained on the 2.4M-clip HiC+ human-scene video dataset, the approach delivers realistic foreground-background interactions, dynamic lighting, and shadows, and even generalizes to non-human subjects; it also achieves practical generation speed (~8.5 seconds per video) for rapid ideation. This work advances visual effects workflows by enabling automated, diverse, and scene-consistent video background generation with strong generalization capabilities.

Abstract

Generating video background that tailors to foreground subject motion is an important problem for the movie industry and visual effects community. This task involves synthesizing background that aligns with the motion and appearance of the foreground subject, while also complies with the artist's creative intention. We introduce ActAnywhere, a generative model that automates this process which traditionally requires tedious manual efforts. Our model leverages the power of large-scale video diffusion models, and is specifically tailored for this task. ActAnywhere takes a sequence of foreground subject segmentation as input and an image that describes the desired scene as condition, to produce a coherent video with realistic foreground-background interactions while adhering to the condition frame. We train our model on a large-scale dataset of human-scene interaction videos. Extensive evaluations demonstrate the superior performance of our model, significantly outperforming baselines. Moreover, we show that ActAnywhere generalizes to diverse out-of-distribution samples, including non-human subjects. Please visit our project webpage at https://actanywhere.github.io.

ActAnywhere: Subject-Aware Video Background Generation

TL;DR

ActAnywhere tackles automated subject-aware video background generation by conditioning a latent diffusion model on a foreground subject sequence and a single background frame. It encodes foreground masks and frames into latents and uses cross-frame attention with CLIP-conditioned background guidance to synthesize temporally coherent videos that follow the subject’s motion while matching the constraint frame. Trained on the 2.4M-clip HiC+ human-scene video dataset, the approach delivers realistic foreground-background interactions, dynamic lighting, and shadows, and even generalizes to non-human subjects; it also achieves practical generation speed (~8.5 seconds per video) for rapid ideation. This work advances visual effects workflows by enabling automated, diverse, and scene-consistent video background generation with strong generalization capabilities.

Abstract

Generating video background that tailors to foreground subject motion is an important problem for the movie industry and visual effects community. This task involves synthesizing background that aligns with the motion and appearance of the foreground subject, while also complies with the artist's creative intention. We introduce ActAnywhere, a generative model that automates this process which traditionally requires tedious manual efforts. Our model leverages the power of large-scale video diffusion models, and is specifically tailored for this task. ActAnywhere takes a sequence of foreground subject segmentation as input and an image that describes the desired scene as condition, to produce a coherent video with realistic foreground-background interactions while adhering to the condition frame. We train our model on a large-scale dataset of human-scene interaction videos. Extensive evaluations demonstrate the superior performance of our model, significantly outperforming baselines. Moreover, we show that ActAnywhere generalizes to diverse out-of-distribution samples, including non-human subjects. Please visit our project webpage at https://actanywhere.github.io.
Paper Structure (17 sections, 3 equations, 10 figures)

This paper contains 17 sections, 3 equations, 10 figures.

Figures (10)

  • Figure 1: Given a sequence of foreground segmentation as input, and one frame that describes the background as the condition, ActAnywhere generates coherent video background that adapts to the subject motion. We show two subjects here, each with two generated samples. ActAnywhere is able to generate videos consistent with the condition frame with highly realistic details such as splatting water, moving smoke and flame, shadows, duck feet, etc. It generalizes to a diverse distribution of subjects and backgrounds, including non-human subjects. Our method works with both composited frames and background-only images as the condition.
  • Figure 2: Architecture overview. During training, we take a randomly sampled frame from the training video to condition the denoising process. At test time, the condition can be either a composited frame of the subject with a novel background, or a background-only image.
  • Figure 3: Additional results with our method. The top part shows examples using inpainted frames as condition, while bottom contains examples with background-only conditioning. Foreground sequences are from the held-out set of HiC+.
  • Figure 4: Comparison with baselines. We provide results on two videos sampled from the DAVIS DAVIS2017 dataset. For each example, we show three representative frames (top) and their corresponding condition signal (left). Note that different methods assume different input, conditioning or pre-trained models, as specified in \ref{['subsec:comparison']}.
  • Figure 5: Zero-shot video inpainting with our model. We show two cases from DAVIS, each with four sampled frames. The yellow regions denote the masked areas to be inpainted.
  • ...and 5 more figures