Table of Contents
Fetching ...

CustomTTT: Motion and Appearance Customized Video Generation via Test-Time Training

Xiuli Bi, Jian Lu, Bo Liu, Xiaodong Cun, Yong Zhang, Weisheng Li, Bin Xiao

TL;DR

This work tackles the problem of jointly customizing appearance and motion in text-to-video diffusion, where naive merging of separate LoRA adapters often yields artifacts. It introduces CustomTTT, which first identifies the most influential layers for appearance and motion and trains LoRAs on those layers separately, then applies a novel test-time training (TTT) step to merge them smoothly by distilling from a single-LoRA teacher using a reference latent and optimizing appearance- and motion-preserving losses. The approach yields superior text-video alignment, visual fidelity, and temporal consistency compared to state-of-the-art methods, while using far fewer trainable parameters. The findings offer practical guidance for multi-concept customization in diffusion-based video generation and point to pathways for improving base models and integration strategies in future work.

Abstract

Benefiting from large-scale pre-training of text-video pairs, current text-to-video (T2V) diffusion models can generate high-quality videos from the text description. Besides, given some reference images or videos, the parameter-efficient fine-tuning method, i.e. LoRA, can generate high-quality customized concepts, e.g., the specific subject or the motions from a reference video. However, combining the trained multiple concepts from different references into a single network shows obvious artifacts. To this end, we propose CustomTTT, where we can joint custom the appearance and the motion of the given video easily. In detail, we first analyze the prompt influence in the current video diffusion model and find the LoRAs are only needed for the specific layers for appearance and motion customization. Besides, since each LoRA is trained individually, we propose a novel test-time training technique to update parameters after combination utilizing the trained customized models. We conduct detailed experiments to verify the effectiveness of the proposed methods. Our method outperforms several state-of-the-art works in both qualitative and quantitative evaluations.

CustomTTT: Motion and Appearance Customized Video Generation via Test-Time Training

TL;DR

This work tackles the problem of jointly customizing appearance and motion in text-to-video diffusion, where naive merging of separate LoRA adapters often yields artifacts. It introduces CustomTTT, which first identifies the most influential layers for appearance and motion and trains LoRAs on those layers separately, then applies a novel test-time training (TTT) step to merge them smoothly by distilling from a single-LoRA teacher using a reference latent and optimizing appearance- and motion-preserving losses. The approach yields superior text-video alignment, visual fidelity, and temporal consistency compared to state-of-the-art methods, while using far fewer trainable parameters. The findings offer practical guidance for multi-concept customization in diffusion-based video generation and point to pathways for improving base models and integration strategies in future work.

Abstract

Benefiting from large-scale pre-training of text-video pairs, current text-to-video (T2V) diffusion models can generate high-quality videos from the text description. Besides, given some reference images or videos, the parameter-efficient fine-tuning method, i.e. LoRA, can generate high-quality customized concepts, e.g., the specific subject or the motions from a reference video. However, combining the trained multiple concepts from different references into a single network shows obvious artifacts. To this end, we propose CustomTTT, where we can joint custom the appearance and the motion of the given video easily. In detail, we first analyze the prompt influence in the current video diffusion model and find the LoRAs are only needed for the specific layers for appearance and motion customization. Besides, since each LoRA is trained individually, we propose a novel test-time training technique to update parameters after combination utilizing the trained customized models. We conduct detailed experiments to verify the effectiveness of the proposed methods. Our method outperforms several state-of-the-art works in both qualitative and quantitative evaluations.

Paper Structure

This paper contains 27 sections, 6 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Given a single video for motion reference and a few images for appearance reference, our method can generate customized videos with multiple customized concepts in terms of the combinations of appearance and motion.
  • Figure 2: The overall pipeline. We first train the LoRAs on the specific layers for appearance (a) and motion (b) customization individually. Then, we design a test-time training method to further improve the results when combining.
  • Figure 3: Examines the influence of the $i$-th layer on the appearance and motion in video generation. The text prompt $p^\ast$ is injected into the $i$-th layer, while the text prompt $p$ is injected into all other layers.
  • Figure 4: The effect of prompt injection on appearance. Injecting $p^*$ into both $i=2,6$ shows comparable results with $p^*$ all injections.
  • Figure 5: The effect of prompt injection on motion. Injecting $p^{*}$ into both $i=2,4$ can remove the influence of prompt $p$.
  • ...and 4 more figures