Table of Contents
Fetching ...

Proteus-ID: ID-Consistent and Motion-Coherent Video Customization

Guiyu Zhang, Chen Shi, Zijian Jiang, Xunzhi Xiang, Jingjing Qian, Shaoshuai Shi, Li Jiang

TL;DR

Proteus-ID, a novel diffusion-based framework for identity-consistent and motion-coherent video customization, and Adaptive Motion Learning (AML), a motion-aware optimization strategy that reweights training loss based on optical-flow-derived motion heatmaps, enhancing motion realism without requiring additional inputs.

Abstract

Video identity customization seeks to synthesize realistic, temporally coherent videos of a specific subject, given a single reference image and a text prompt. This task presents two core challenges: (1) maintaining identity consistency while aligning with the described appearance and actions, and (2) generating natural, fluid motion without unrealistic stiffness. To address these challenges, we introduce Proteus-ID, a novel diffusion-based framework for identity-consistent and motion-coherent video customization. First, we propose a Multimodal Identity Fusion (MIF) module that unifies visual and textual cues into a joint identity representation using a Q-Former, providing coherent guidance to the diffusion model and eliminating modality imbalance. Second, we present a Time-Aware Identity Injection (TAII) mechanism that dynamically modulates identity conditioning across denoising steps, improving fine-detail reconstruction. Third, we propose Adaptive Motion Learning (AML), a self-supervised strategy that reweights the training loss based on optical-flow-derived motion heatmaps, enhancing motion realism without requiring additional inputs. To support this task, we construct Proteus-Bench, a high-quality dataset comprising 200K curated clips for training and 150 individuals from diverse professions and ethnicities for evaluation. Extensive experiments demonstrate that Proteus-ID outperforms prior methods in identity preservation, text alignment, and motion quality, establishing a new benchmark for video identity customization. Codes and data are publicly available at https://grenoble-zhang.github.io/Proteus-ID/.

Proteus-ID: ID-Consistent and Motion-Coherent Video Customization

TL;DR

Proteus-ID, a novel diffusion-based framework for identity-consistent and motion-coherent video customization, and Adaptive Motion Learning (AML), a motion-aware optimization strategy that reweights training loss based on optical-flow-derived motion heatmaps, enhancing motion realism without requiring additional inputs.

Abstract

Video identity customization seeks to synthesize realistic, temporally coherent videos of a specific subject, given a single reference image and a text prompt. This task presents two core challenges: (1) maintaining identity consistency while aligning with the described appearance and actions, and (2) generating natural, fluid motion without unrealistic stiffness. To address these challenges, we introduce Proteus-ID, a novel diffusion-based framework for identity-consistent and motion-coherent video customization. First, we propose a Multimodal Identity Fusion (MIF) module that unifies visual and textual cues into a joint identity representation using a Q-Former, providing coherent guidance to the diffusion model and eliminating modality imbalance. Second, we present a Time-Aware Identity Injection (TAII) mechanism that dynamically modulates identity conditioning across denoising steps, improving fine-detail reconstruction. Third, we propose Adaptive Motion Learning (AML), a self-supervised strategy that reweights the training loss based on optical-flow-derived motion heatmaps, enhancing motion realism without requiring additional inputs. To support this task, we construct Proteus-Bench, a high-quality dataset comprising 200K curated clips for training and 150 individuals from diverse professions and ethnicities for evaluation. Extensive experiments demonstrate that Proteus-ID outperforms prior methods in identity preservation, text alignment, and motion quality, establishing a new benchmark for video identity customization. Codes and data are publicly available at https://grenoble-zhang.github.io/Proteus-ID/.

Paper Structure

This paper contains 17 sections, 12 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Examples of video identity customization generated by Proteus-ID. Given a reference image and a text description, Proteus-ID can synthesize compelling and expressive animations. Notably, Proteus-ID effectively preserves identity consistency, maintains semantic alignment across diverse styles, and produces natural, temporally coherent motion. Orange highlights attributes mentioned in instructions.
  • Figure 2: Overview of Proteus-ID. Built on a pre-trained DiT, Proteus-ID integrates three key components: Multimodal Identity Fusion (MIF), Time-Aware Identity Injection (TAII), and Adaptive Motion Learning (AML). Given a reference image and user prompt, MIF uses a Q-Former to integrate identity text embeddings with visual features prior to denoising. TAII incorporates timestep embeddings to adaptively modulate identity conditioning during denoising. AML enhances motion realism by introducing a self-supervised motion signal to reweight the training loss—without requiring additional inputs at inference.
  • Figure 3: Qualitative comparison with state-of-the-art methods. ID-Animator exhibits poor identity preservation and visual quality. While ConsisID, Fantasy, and Concat-ID improve identity preservation to some extent, they suffer from severe copy-paste artifacts and text misalignment. EchoVideo maintains consistent identity and text alignment but lacks fluid, natural motion. In contrast, our method achieves strong performance in visual quality, identity preservation, text alignment, and coherent motion, substantially outperforming baseline methods.
  • Figure 4: Effect of Different Components via Qualitative Analysis. (a) Removing Multimodal Identity Fusion (MIF) leads to severe text misalignment. (b) Removing Time-aware Identity Injection (TAII) affects identity details. (c) Removing Adaptive Motion Learning (AML) results in stiff motion.
  • Figure 5: Video statistics of the dataset. The dataset comprises diverse video durations and caption lengths, with most videos in 1080P.
  • ...and 2 more figures