Table of Contents
Fetching ...

JointTuner: Appearance-Motion Adaptive Joint Training for Customized Video Generation

Fangda Chen, Shanshan Zhao, Chuanfu Xu, Long Lan

TL;DR

JointTuner tackles appearance-motion combined video customization by addressing concept interference and appearance leakage seen in stage-wise methods. It introduces Gated Low-Rank Adaptation (GLoRA) to dynamically fuse appearance and motion experts and Appearance-independent Temporal Loss (AiT Loss) to bias learning toward motion patterns. The framework is architecture agnostic and evaluated on a comprehensive benchmark of 90 subject-motion combinations with 10 metrics across semantic alignment, motion dynamics, temporal coherence, and perceptual quality. Empirical results show that JointTuner achieves a balanced, high-quality synthesis across UNet and Diffusion Transformer backbones, outperforming prior methods and demonstrating robust joint optimization. The work also establishes a standardized evaluation protocol for appearance-motion customization and points toward future directions including 3D-aware representations.

Abstract

Recent advancements in customized video generation have led to significant improvements in the simultaneous adaptation of appearance and motion. Typically, decoupling the appearance and motion training, prior methods often introduce concept interference, resulting in inaccurate rendering of appearance features or motion patterns. In addition, these methods often suffer from appearance contamination, in which background and foreground elements from reference videos distort the customized video. This paper aims to alleviate these issues by proposing JointTuner. The core motivation of our JointTuner is to enable joint optimization of both appearance and motion components, upon which two key innovations are developed, i.e., Gated Low-Rank Adaptation (GLoRA) and Appearance-independent Temporal Loss (AiT Loss). Specifically, GLoRA uses a context-aware activation layer, analogous to a gating regulator, to dynamically steer LoRA modules toward learning either appearance or motion while maintaining spatio-temporal consistency. Moreover, with the finding that channel-temporal shift noise suppresses appearance-related low-frequencies while enhancing motion-related high-frequencies, we designed the AiT Loss. This loss adds the same shift to the diffusion model's predicted noise during fine-tuning, forcing the model to prioritize learning motion patterns. JointTuner's architecture-agnostic design supports both UNet (e.g., ZeroScope) and Diffusion Transformer (e.g., CogVideoX) backbones, ensuring its customization capabilities scale with the evolution of foundational video models. Furthermore, we present a systematic evaluation framework for appearance-motion combined customization, covering 90 combinations evaluated along four critical dimensions: semantic alignment, motion dynamism, temporal consistency, and perceptual quality. Our project homepage is available online.

JointTuner: Appearance-Motion Adaptive Joint Training for Customized Video Generation

TL;DR

JointTuner tackles appearance-motion combined video customization by addressing concept interference and appearance leakage seen in stage-wise methods. It introduces Gated Low-Rank Adaptation (GLoRA) to dynamically fuse appearance and motion experts and Appearance-independent Temporal Loss (AiT Loss) to bias learning toward motion patterns. The framework is architecture agnostic and evaluated on a comprehensive benchmark of 90 subject-motion combinations with 10 metrics across semantic alignment, motion dynamics, temporal coherence, and perceptual quality. Empirical results show that JointTuner achieves a balanced, high-quality synthesis across UNet and Diffusion Transformer backbones, outperforming prior methods and demonstrating robust joint optimization. The work also establishes a standardized evaluation protocol for appearance-motion customization and points toward future directions including 3D-aware representations.

Abstract

Recent advancements in customized video generation have led to significant improvements in the simultaneous adaptation of appearance and motion. Typically, decoupling the appearance and motion training, prior methods often introduce concept interference, resulting in inaccurate rendering of appearance features or motion patterns. In addition, these methods often suffer from appearance contamination, in which background and foreground elements from reference videos distort the customized video. This paper aims to alleviate these issues by proposing JointTuner. The core motivation of our JointTuner is to enable joint optimization of both appearance and motion components, upon which two key innovations are developed, i.e., Gated Low-Rank Adaptation (GLoRA) and Appearance-independent Temporal Loss (AiT Loss). Specifically, GLoRA uses a context-aware activation layer, analogous to a gating regulator, to dynamically steer LoRA modules toward learning either appearance or motion while maintaining spatio-temporal consistency. Moreover, with the finding that channel-temporal shift noise suppresses appearance-related low-frequencies while enhancing motion-related high-frequencies, we designed the AiT Loss. This loss adds the same shift to the diffusion model's predicted noise during fine-tuning, forcing the model to prioritize learning motion patterns. JointTuner's architecture-agnostic design supports both UNet (e.g., ZeroScope) and Diffusion Transformer (e.g., CogVideoX) backbones, ensuring its customization capabilities scale with the evolution of foundational video models. Furthermore, we present a systematic evaluation framework for appearance-motion combined customization, covering 90 combinations evaluated along four critical dimensions: semantic alignment, motion dynamism, temporal consistency, and perceptual quality. Our project homepage is available online.

Paper Structure

This paper contains 31 sections, 11 equations, 17 figures, 9 tables, 2 algorithms.

Figures (17)

  • Figure 1: Results of customized video generation using JointTuner. Given paired appearance and motion inputs, it produces videos that reflect both the desired subject appearance and motion patterns through adaptive joint training.
  • Figure 2: Illustration of a failure case from advanced customized video generation methods. (a) and (b) show videos generated by MotionDirector-ZS motiondirector and DreamVideo-ZS dreamvideo across three stages: appearance learning (bear plushie), motion learning (twirling), and combined inference. (c) presents results from JointTuner-ZS, which improves inference by jointly learning both appearance and motion. Note that "-ZS" denotes methods based on ZeroScope zeroscope.
  • Figure 3: Illustration of the impact of noise shift on diffusion inversion in latent space with CogVideoX 2022cogvideo. Starting from a clean video, random noise $\epsilon \in \mathbb{R}^{F \times C \times H \times W}$ is added during the diffusion forward process, followed by diffusion inversion to recover the original video. Red bounding boxes highlight regions of interest.
  • Figure 4: Frequency distribution of latent and shifted latent signals across shift types and time steps. These results show normalized energy for three shift types: Channel-wise, Spatial-wise, and Temporal-wise at time steps 200, 400, 600, and 800.
  • Figure 5: Architecture of JointTuner, an adaptive joint training framework with two main steps: (1) integrating GLoRA into the transformer blocks for efficient fine-tuning, and (2) optimizing GLoRA with two complementary losses. The original diffusion loss leverages reference images to preserve appearance details, and the AiT Loss utilizes reference videos to focus on motion patterns. The pre-trained text-to-video model remains frozen throughout training; only GLoRA parameters are updated. During inference, trained GLoRA weights are loaded, and customized videos are generated conditioned solely on the input prompt.
  • ...and 12 more figures