Table of Contents
Fetching ...

SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner

Yufan Zhou, Ruiyi Zhang, Jiuxiang Gu, Nanxuan Zhao, Jing Shi, Tong Sun

TL;DR

SUGAR tackles zero-shot subject-driven video customization from a single image, enabling videos that reflect arbitrary user-described attributes without test-time fine-tuning. It introduces a transformer-based diffusion framework operating in a latent space, enhanced by a large synthetic dataset of 2.5M image-video-text triplets and targeted training strategies, selective attention, and improved sampling. The approach combines real-world motion data with synthetic examples to improve both identity preservation and style/motion alignment, achieving state-of-the-art results across identity, text alignment, dynamics, and consistency. The work also provides extensive ablations to validate the necessity of the dataset, the chosen attention design, and the dual-conditioning strategy, underscoring practical impact for zero-shot, customizable video generation.

Abstract

We present SUGAR, a zero-shot method for subject-driven video customization. Given an input image, SUGAR is capable of generating videos for the subject contained in the image and aligning the generation with arbitrary visual attributes such as style and motion specified by user-input text. Unlike previous methods, which require test-time fine-tuning or fail to generate text-aligned videos, SUGAR achieves superior results without the need for extra cost at test-time. To enable zero-shot capability, we introduce a scalable pipeline to construct synthetic dataset which is specifically designed for subject-driven customization, leading to 2.5 millions of image-video-text triplets. Additionally, we propose several methods to enhance our model, including special attention designs, improved training strategies, and a refined sampling algorithm. Extensive experiments are conducted. Compared to previous methods, SUGAR achieves state-of-the-art results in identity preservation, video dynamics, and video-text alignment for subject-driven video customization, demonstrating the effectiveness of our proposed method.

SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner

TL;DR

SUGAR tackles zero-shot subject-driven video customization from a single image, enabling videos that reflect arbitrary user-described attributes without test-time fine-tuning. It introduces a transformer-based diffusion framework operating in a latent space, enhanced by a large synthetic dataset of 2.5M image-video-text triplets and targeted training strategies, selective attention, and improved sampling. The approach combines real-world motion data with synthetic examples to improve both identity preservation and style/motion alignment, achieving state-of-the-art results across identity, text alignment, dynamics, and consistency. The work also provides extensive ablations to validate the necessity of the dataset, the chosen attention design, and the dual-conditioning strategy, underscoring practical impact for zero-shot, customizable video generation.

Abstract

We present SUGAR, a zero-shot method for subject-driven video customization. Given an input image, SUGAR is capable of generating videos for the subject contained in the image and aligning the generation with arbitrary visual attributes such as style and motion specified by user-input text. Unlike previous methods, which require test-time fine-tuning or fail to generate text-aligned videos, SUGAR achieves superior results without the need for extra cost at test-time. To enable zero-shot capability, we introduce a scalable pipeline to construct synthetic dataset which is specifically designed for subject-driven customization, leading to 2.5 millions of image-video-text triplets. Additionally, we propose several methods to enhance our model, including special attention designs, improved training strategies, and a refined sampling algorithm. Extensive experiments are conducted. Compared to previous methods, SUGAR achieves state-of-the-art results in identity preservation, video dynamics, and video-text alignment for subject-driven video customization, demonstrating the effectiveness of our proposed method.

Paper Structure

This paper contains 32 sections, 4 equations, 21 figures, 1 table.

Figures (21)

  • Figure 1: Generated examples from the proposed method, where the frames are randomly sampled from the generated video. Our proposed method will generate videos for a specific subject contained in user-input image, in a zero-shot manner. The generated video will also meet the requirements described by user-input text.
  • Figure 2: Illustration of our model, randomly sampled frames from the generated video are shown in the figure for better illustration.
  • Figure 3: The proposed pipeline for synthetic data generation.
  • Figure 4: Different attention designs of our proposed model. One embedding can attend to another one only when the corresponding position is marked shadow in the above illustration. For instance, in design (b) image embeddings can not attend to the first frame, but the first frame can attend to the image embeddings.
  • Figure 5: Comparison of models with different attention designs.
  • ...and 16 more figures