JointTuner: Appearance-Motion Adaptive Joint Training for Customized Video Generation
Fangda Chen, Shanshan Zhao, Chuanfu Xu, Long Lan
TL;DR
JointTuner tackles appearance-motion combined video customization by addressing concept interference and appearance leakage seen in stage-wise methods. It introduces Gated Low-Rank Adaptation (GLoRA) to dynamically fuse appearance and motion experts and Appearance-independent Temporal Loss (AiT Loss) to bias learning toward motion patterns. The framework is architecture agnostic and evaluated on a comprehensive benchmark of 90 subject-motion combinations with 10 metrics across semantic alignment, motion dynamics, temporal coherence, and perceptual quality. Empirical results show that JointTuner achieves a balanced, high-quality synthesis across UNet and Diffusion Transformer backbones, outperforming prior methods and demonstrating robust joint optimization. The work also establishes a standardized evaluation protocol for appearance-motion customization and points toward future directions including 3D-aware representations.
Abstract
Recent advancements in customized video generation have led to significant improvements in the simultaneous adaptation of appearance and motion. Typically, decoupling the appearance and motion training, prior methods often introduce concept interference, resulting in inaccurate rendering of appearance features or motion patterns. In addition, these methods often suffer from appearance contamination, in which background and foreground elements from reference videos distort the customized video. This paper aims to alleviate these issues by proposing JointTuner. The core motivation of our JointTuner is to enable joint optimization of both appearance and motion components, upon which two key innovations are developed, i.e., Gated Low-Rank Adaptation (GLoRA) and Appearance-independent Temporal Loss (AiT Loss). Specifically, GLoRA uses a context-aware activation layer, analogous to a gating regulator, to dynamically steer LoRA modules toward learning either appearance or motion while maintaining spatio-temporal consistency. Moreover, with the finding that channel-temporal shift noise suppresses appearance-related low-frequencies while enhancing motion-related high-frequencies, we designed the AiT Loss. This loss adds the same shift to the diffusion model's predicted noise during fine-tuning, forcing the model to prioritize learning motion patterns. JointTuner's architecture-agnostic design supports both UNet (e.g., ZeroScope) and Diffusion Transformer (e.g., CogVideoX) backbones, ensuring its customization capabilities scale with the evolution of foundational video models. Furthermore, we present a systematic evaluation framework for appearance-motion combined customization, covering 90 combinations evaluated along four critical dimensions: semantic alignment, motion dynamism, temporal consistency, and perceptual quality. Our project homepage is available online.
