First Frame Is the Place to Go for Video Content Customization
Jingxi Chen, Zongxia Li, Zhichao Liu, Guangyao Shi, Xiyang Wu, Fuxiao Liu, Cornelia Fermuller, Brandon Y. Feng, Yiannis Aloimonos
TL;DR
This work reframes the first video frame as a conceptual memory buffer that allows multi-reference content customization without architectural changes. It introduces FFGo, a lightweight pipeline that combines Vision-Language Model–driven data curation with few-shot LoRA adaptation (20–50 examples) to invoke the model's innate fusion capabilities. The approach achieves robust, generalizable, reference-based video customization across diverse applications while preserving pre-trained priors, as demonstrated by both qualitative assessments and a user study. The practical impact is a scalable, data-efficient method for customized video generation suitable for filmmaking, robotics, and simulation tasks.
Abstract
What role does the first frame play in video generation models? Traditionally, it's viewed as the spatial-temporal starting point of a video, merely a seed for subsequent animation. In this work, we reveal a fundamentally different perspective: video models implicitly treat the first frame as a conceptual memory buffer that stores visual entities for later reuse during generation. Leveraging this insight, we show that it's possible to achieve robust and generalized video content customization in diverse scenarios, using only 20-50 training examples without architectural changes or large-scale finetuning. This unveils a powerful, overlooked capability of video generation models for reference-based video customization.
