Table of Contents
Fetching ...

First Frame Is the Place to Go for Video Content Customization

Jingxi Chen, Zongxia Li, Zhichao Liu, Guangyao Shi, Xiyang Wu, Fuxiao Liu, Cornelia Fermuller, Brandon Y. Feng, Yiannis Aloimonos

TL;DR

This work reframes the first video frame as a conceptual memory buffer that allows multi-reference content customization without architectural changes. It introduces FFGo, a lightweight pipeline that combines Vision-Language Model–driven data curation with few-shot LoRA adaptation (20–50 examples) to invoke the model's innate fusion capabilities. The approach achieves robust, generalizable, reference-based video customization across diverse applications while preserving pre-trained priors, as demonstrated by both qualitative assessments and a user study. The practical impact is a scalable, data-efficient method for customized video generation suitable for filmmaking, robotics, and simulation tasks.

Abstract

What role does the first frame play in video generation models? Traditionally, it's viewed as the spatial-temporal starting point of a video, merely a seed for subsequent animation. In this work, we reveal a fundamentally different perspective: video models implicitly treat the first frame as a conceptual memory buffer that stores visual entities for later reuse during generation. Leveraging this insight, we show that it's possible to achieve robust and generalized video content customization in diverse scenarios, using only 20-50 training examples without architectural changes or large-scale finetuning. This unveils a powerful, overlooked capability of video generation models for reference-based video customization.

First Frame Is the Place to Go for Video Content Customization

TL;DR

This work reframes the first video frame as a conceptual memory buffer that allows multi-reference content customization without architectural changes. It introduces FFGo, a lightweight pipeline that combines Vision-Language Model–driven data curation with few-shot LoRA adaptation (20–50 examples) to invoke the model's innate fusion capabilities. The approach achieves robust, generalizable, reference-based video customization across diverse applications while preserving pre-trained priors, as demonstrated by both qualitative assessments and a user study. The practical impact is a scalable, data-efficient method for customized video generation suitable for filmmaking, robotics, and simulation tasks.

Abstract

What role does the first frame play in video generation models? Traditionally, it's viewed as the spatial-temporal starting point of a video, merely a seed for subsequent animation. In this work, we reveal a fundamentally different perspective: video models implicitly treat the first frame as a conceptual memory buffer that stores visual entities for later reuse during generation. Leveraging this insight, we show that it's possible to achieve robust and generalized video content customization in diverse scenarios, using only 20-50 training examples without architectural changes or large-scale finetuning. This unveils a powerful, overlooked capability of video generation models for reference-based video customization.

Paper Structure

This paper contains 38 sections, 13 figures, 1 table.

Figures (13)

  • Figure 1: In this figure, we illustrate a general yet under-explored observation: video generation models possess an innate ability to perform subject mixing via scene transitions from a mixed-subject first frame. As shown, the red-boxed results (without the transition phrase: $<$transition$>$) contrast with the blue-boxed results (with a carefully chosen $<$transition$>$ e.g., “The camera view suddenly zoom in to show") revealing significant differences in composition. However, this phenomenon faces three key limitations that hinder practical use: 1) The prompt engineering process for $<$transition$>$ is highly manual, time-consuming, and model/video-dependent. 2) Scene transitions are often unstable. 3) Object identity is often lost, resulting in changes in appearance or the disappearance of reference objects.
  • Figure 2: The overview of our proposed pipeline FFGo, consists of 1) Dataset Curation for getting the high quality finetuning data from existing videos, 2) Few-shot LoRA Adaptation for training/inference to invoke the I2V model's innate ability in fusing the subjects in the first frame and perform a scene trasition to generate a video $V_{mix}$ following subjects in the first mixing frame $I_{mix}$ and the text prompt.
  • Figure 3: As shown in the figure, in rare cases where the base model Wan2.2-I2V-A14B successfully performs a scene transition while preserving all reference object identities, the output closely resembles ours. This demonstrates that our add-on approach effectively retains the base model’s pre-trained generative capabilities.
  • Figure 4: Qualitative comparison with baseline methods. This test scenario involves generalized multi-object interactions. As shown in the figure, our method best preserves the identities of input objects and the scene, while generating a customized video with coherent motion that aligns with the text prompt description.
  • Figure 5: Qualitative comparison with baseline methods. This scenario evaluates performance with an excessive number of references, five in total (four objects and one scene). VACE and SkyReels-A2, due to their architecture-based limitations, support only up to three references and fail to include all four reference objects in the generated video. In contrast, our model successfully fuses all four objects into a coherent, customized video with natural interactions. Notably, our model also enables precise selection via text prompt (e.g., blue iPhone), preserving key visual traits such as the triple-camera design while modifying appearance (e.g., changing the color to blue).
  • ...and 8 more figures