Table of Contents
Fetching ...

Dynamic Concepts Personalization from Single Videos

Rameen Abdal, Or Patashnik, Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, Sergey Tulyakov, Daniel Cohen-Or, Kfir Aberman

TL;DR

This work tackles the challenge of personalizing text-to-video models to dynamic concepts defined by appearance and motion. It introduces Set-and-Sequence, a two-stage LoRA-based framework that first learns an identity basis from unordered frames and then augments it with a motion residual learned from full video sequences, all embedded in a unified spatio-temporal weight space within a Diffusion Transformer. Regularization techniques, including Prior Preservation, High-Dropout for high-rank LoRA, and Context-Aware regularization, stabilize training and enable robust editing and composition. The approach demonstrates superior editing fidelity, compositionality, and motion coherence on human-centric videos, outperforming several baselines and enabling intuitive prompt-driven reconfiguration of dynamic concepts with practical impact for personalized, controllable video generation.

Abstract

Personalizing generative text-to-image models has seen remarkable progress, but extending this personalization to text-to-video models presents unique challenges. Unlike static concepts, personalizing text-to-video models has the potential to capture dynamic concepts, i.e., entities defined not only by their appearance but also by their motion. In this paper, we introduce Set-and-Sequence, a novel framework for personalizing Diffusion Transformers (DiTs)-based generative video models with dynamic concepts. Our approach imposes a spatio-temporal weight space within an architecture that does not explicitly separate spatial and temporal features. This is achieved in two key stages. First, we fine-tune Low-Rank Adaptation (LoRA) layers using an unordered set of frames from the video to learn an identity LoRA basis that represents the appearance, free from temporal interference. In the second stage, with the identity LoRAs frozen, we augment their coefficients with Motion Residuals and fine-tune them on the full video sequence, capturing motion dynamics. Our Set-and-Sequence framework results in a spatio-temporal weight space that effectively embeds dynamic concepts into the video model's output domain, enabling unprecedented editability and compositionality while setting a new benchmark for personalizing dynamic concepts.

Dynamic Concepts Personalization from Single Videos

TL;DR

This work tackles the challenge of personalizing text-to-video models to dynamic concepts defined by appearance and motion. It introduces Set-and-Sequence, a two-stage LoRA-based framework that first learns an identity basis from unordered frames and then augments it with a motion residual learned from full video sequences, all embedded in a unified spatio-temporal weight space within a Diffusion Transformer. Regularization techniques, including Prior Preservation, High-Dropout for high-rank LoRA, and Context-Aware regularization, stabilize training and enable robust editing and composition. The approach demonstrates superior editing fidelity, compositionality, and motion coherence on human-centric videos, outperforming several baselines and enabling intuitive prompt-driven reconfiguration of dynamic concepts with practical impact for personalized, controllable video generation.

Abstract

Personalizing generative text-to-image models has seen remarkable progress, but extending this personalization to text-to-video models presents unique challenges. Unlike static concepts, personalizing text-to-video models has the potential to capture dynamic concepts, i.e., entities defined not only by their appearance but also by their motion. In this paper, we introduce Set-and-Sequence, a novel framework for personalizing Diffusion Transformers (DiTs)-based generative video models with dynamic concepts. Our approach imposes a spatio-temporal weight space within an architecture that does not explicitly separate spatial and temporal features. This is achieved in two key stages. First, we fine-tune Low-Rank Adaptation (LoRA) layers using an unordered set of frames from the video to learn an identity LoRA basis that represents the appearance, free from temporal interference. In the second stage, with the identity LoRAs frozen, we augment their coefficients with Motion Residuals and fine-tune them on the full video sequence, capturing motion dynamics. Our Set-and-Sequence framework results in a spatio-temporal weight space that effectively embeds dynamic concepts into the video model's output domain, enabling unprecedented editability and compositionality while setting a new benchmark for personalizing dynamic concepts.

Paper Structure

This paper contains 29 sections, 10 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Set-and-Sequence framework operates in two stages: (i) Identity Basis: We train LoRA Set Encoding on a unordered set of frames extracted from the video, focusing only on the appearance of the dynamic concept to achieve high fidelity without temporal distractions. (ii) Motion Residuals: The Basis of the Identity LoRAs is frozen and the coefficient part is augmented with coefficients of LoRA Sequence Encoding trained on the temporal sequence of full video clip, allowing the model to capture the motion dynamics of the concept.
  • Figure 2: Local and Global Editing. Our Set-and-Sequence framework enables text-driven edits of dynamic concepts while preserving both their appearance and motion. Edits can be global (e.g., background and lighting) or local (e.g., clothing and object replacement), ensuring high fidelity to the original dynamic concepts.
  • Figure 3: Stylization. Top: Stylization of dynamic concepts achieved by reweighting the identity basis. Bottom: Stylization and motion editing performed using prompt derived from the video in the top row.
  • Figure 4: Dynamic Concepts Composition. Composition results achieved by our framework showcasing seamless integration of dynamic concepts. with each concept color-coded for clarity. For a more comprehensive demonstration, refer to the supplementary videos.
  • Figure 5: Comparison with baselines. Comparison of our method with baseline approaches (NewMove materzynska2024newmove, DreamVideo wei2023dreamvideo, DB-LoRA simoruiz2023dreambooth, and DreamMix molad2023dreamixvideodiffusionmodels) on two editing scenarios: changing the background and shirt, and adding a glass. Our method demonstrates superior adherence to the prompt while preserving the subject identity, outperforming the baselines.
  • ...and 2 more figures