Table of Contents
Fetching ...

OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions

Yuanhao Cai, He Zhang, Xi Chen, Jinbo Xing, Yiwei Hu, Yuqian Zhou, Kai Zhang, Zhifei Zhang, Soo Ye Kim, Tianyu Wang, Yulun Zhang, Xiaokang Yang, Zhe Lin, Alan Yuille

TL;DR

OmniVCus tackles multi-subject subject-driven video customization under multimodal control by constructing training data from raw videos and transferring image-level edits to video. It introduces VideoCus-Factory for data generation, IVTM for image-to-video transfer, and OmniVCus, a diffusion Transformer with Lottery Embedding (LE) and Temporally Aligned Embedding (TAE) to scale subject usage and align temporal control signals. The approach enables flexible composition of depth, mask, camera, and text prompts to guide subject editing and movement in video, achieving state-of-the-art results across single- to quadruple-subject and camera-controlled tasks. This work provides a practical, scalable framework for controllable video synthesis with rich multimodal controls and offers public code and data for reproducibility.

Abstract

Existing feedforward subject-driven video customization methods mainly study single-subject scenarios due to the difficulty of constructing multi-subject training data pairs. Another challenging problem that how to use the signals such as depth, mask, camera, and text prompts to control and edit the subject in the customized video is still less explored. In this paper, we first propose a data construction pipeline, VideoCus-Factory, to produce training data pairs for multi-subject customization from raw videos without labels and control signals such as depth-to-video and mask-to-video pairs. Based on our constructed data, we develop an Image-Video Transfer Mixed (IVTM) training with image editing data to enable instructive editing for the subject in the customized video. Then we propose a diffusion Transformer framework, OmniVCus, with two embedding mechanisms, Lottery Embedding (LE) and Temporally Aligned Embedding (TAE). LE enables inference with more subjects by using the training subjects to activate more frame embeddings. TAE encourages the generation process to extract guidance from temporally aligned control signals by assigning the same frame embeddings to the control and noise tokens. Experiments demonstrate that our method significantly surpasses state-of-the-art methods in both quantitative and qualitative evaluations. Video demos are at our project page: https://caiyuanhao1998.github.io/project/OmniVCus/. Our code, models, data are released at https://github.com/caiyuanhao1998/Open-OmniVCus

OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions

TL;DR

OmniVCus tackles multi-subject subject-driven video customization under multimodal control by constructing training data from raw videos and transferring image-level edits to video. It introduces VideoCus-Factory for data generation, IVTM for image-to-video transfer, and OmniVCus, a diffusion Transformer with Lottery Embedding (LE) and Temporally Aligned Embedding (TAE) to scale subject usage and align temporal control signals. The approach enables flexible composition of depth, mask, camera, and text prompts to guide subject editing and movement in video, achieving state-of-the-art results across single- to quadruple-subject and camera-controlled tasks. This work provides a practical, scalable framework for controllable video synthesis with rich multimodal controls and offers public code and data for reproducibility.

Abstract

Existing feedforward subject-driven video customization methods mainly study single-subject scenarios due to the difficulty of constructing multi-subject training data pairs. Another challenging problem that how to use the signals such as depth, mask, camera, and text prompts to control and edit the subject in the customized video is still less explored. In this paper, we first propose a data construction pipeline, VideoCus-Factory, to produce training data pairs for multi-subject customization from raw videos without labels and control signals such as depth-to-video and mask-to-video pairs. Based on our constructed data, we develop an Image-Video Transfer Mixed (IVTM) training with image editing data to enable instructive editing for the subject in the customized video. Then we propose a diffusion Transformer framework, OmniVCus, with two embedding mechanisms, Lottery Embedding (LE) and Temporally Aligned Embedding (TAE). LE enables inference with more subjects by using the training subjects to activate more frame embeddings. TAE encourages the generation process to extract guidance from temporally aligned control signals by assigning the same frame embeddings to the control and noise tokens. Experiments demonstrate that our method significantly surpasses state-of-the-art methods in both quantitative and qualitative evaluations. Video demos are at our project page: https://caiyuanhao1998.github.io/project/OmniVCus/. Our code, models, data are released at https://github.com/caiyuanhao1998/Open-OmniVCus

Paper Structure

This paper contains 10 sections, 3 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: (a) and (c) show that our method can change the pose and action of the subject. (b) The instructive editing texts are in purple color. (e1) and (e2) show that our method trained with only two subjects but can compose more subjects in inference. (d), (f), and (g) are the results under different controls. In (h) and (i), although the subjects are not aligned with the mask or depth, our method can transfer the texture of the subjects.
  • Figure 2: Our method can flexibly compose different conditions to control multi-subject video customization.
  • Figure 3: Our data construction pipeline VideoCus-Factory uses Kosmos-2 kosmos-2 to caption the raw video and detect the subjects. Then we use SAM-2 sam2 to segment and filter the detected subjects to derive the training input images. VideoCus-Factory also constructs control data pairs such as mask-to-video and depth-to-video.
  • Figure 4: OmniVCus is DiT architecture that can compose different input signals to customize a video. (a) LE enables more-subject customization in inference by activating more frame embeddings with training subjects. (b) TAE extracts the guidance from control signals by aligning the frame embeddings of condition and noise tokens.
  • Figure 5: Visual comparison of single-subject video customization with state-of-the-art algorithms. Our method can change the pose and viewpoint of the subject while keeping the identity such as the hair, jacket, and sweater.
  • ...and 4 more figures