Table of Contents
Fetching ...

ContextAnyone: Context-Aware Diffusion for Character-Consistent Text-to-Video Generation

Ziyang Mai, Yu-Wing Tai

TL;DR

ContextAnyone tackles the core challenge of preserving character identity and contextual appearance across T2V sequences. It integrates a dual-encoder framework with a DiT-based diffusion backbone and introduces Emphasize-Attention to reinforce reference cues, along with Gap-RoPE to stabilize temporal modeling. By jointly reconstructing the reference image and generating video frames, and by using a purpose-built dataset pipeline to avoid trivial pixel copying, the method achieves superior identity fidelity and visual quality across diverse motions. The approach offers practical improvements for narrative consistency in character-driven video generation and sets the stage for multi-reference and multi-character extensions.

Abstract

Text-to-video (T2V) generation has advanced rapidly, yet maintaining consistent character identities across scenes remains a major challenge. Existing personalization methods often focus on facial identity but fail to preserve broader contextual cues such as hairstyle, outfit, and body shape, which are critical for visual coherence. We propose \textbf{ContextAnyone}, a context-aware diffusion framework that achieves character-consistent video generation from text and a single reference image. Our method jointly reconstructs the reference image and generates new video frames, enabling the model to fully perceive and utilize reference information. Reference information is effectively integrated into a DiT-based diffusion backbone through a novel Emphasize-Attention module that selectively reinforces reference-aware features and prevents identity drift across frames. A dual-guidance loss combines diffusion and reference reconstruction objectives to enhance appearance fidelity, while the proposed Gap-RoPE positional embedding separates reference and video tokens to stabilize temporal modeling. Experiments demonstrate that ContextAnyone outperforms existing reference-to-video methods in identity consistency and visual quality, generating coherent and context-preserving character videos across diverse motions and scenes. Project page: \href{https://github.com/ziyang1106/ContextAnyone}{https://github.com/ziyang1106/ContextAnyone}.

ContextAnyone: Context-Aware Diffusion for Character-Consistent Text-to-Video Generation

TL;DR

ContextAnyone tackles the core challenge of preserving character identity and contextual appearance across T2V sequences. It integrates a dual-encoder framework with a DiT-based diffusion backbone and introduces Emphasize-Attention to reinforce reference cues, along with Gap-RoPE to stabilize temporal modeling. By jointly reconstructing the reference image and generating video frames, and by using a purpose-built dataset pipeline to avoid trivial pixel copying, the method achieves superior identity fidelity and visual quality across diverse motions. The approach offers practical improvements for narrative consistency in character-driven video generation and sets the stage for multi-reference and multi-character extensions.

Abstract

Text-to-video (T2V) generation has advanced rapidly, yet maintaining consistent character identities across scenes remains a major challenge. Existing personalization methods often focus on facial identity but fail to preserve broader contextual cues such as hairstyle, outfit, and body shape, which are critical for visual coherence. We propose \textbf{ContextAnyone}, a context-aware diffusion framework that achieves character-consistent video generation from text and a single reference image. Our method jointly reconstructs the reference image and generates new video frames, enabling the model to fully perceive and utilize reference information. Reference information is effectively integrated into a DiT-based diffusion backbone through a novel Emphasize-Attention module that selectively reinforces reference-aware features and prevents identity drift across frames. A dual-guidance loss combines diffusion and reference reconstruction objectives to enhance appearance fidelity, while the proposed Gap-RoPE positional embedding separates reference and video tokens to stabilize temporal modeling. Experiments demonstrate that ContextAnyone outperforms existing reference-to-video methods in identity consistency and visual quality, generating coherent and context-preserving character videos across diverse motions and scenes. Project page: \href{https://github.com/ziyang1106/ContextAnyone}{https://github.com/ziyang1106/ContextAnyone}.

Paper Structure

This paper contains 14 sections, 6 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Overview of ContextAnyone. Given a reference image and a text prompt, our model generates character-consistent videos that preserve visual details across diverse scenes, while prior methods struggle to retain all elements from the reference. Pink boxes highlight key details such as the chef hat, collar shape, and pant. Green boxes indicate regions where these details are faithfully preserved, whereas red boxes mark inconsistencies, such as the collar mismatch in the lower left. Many other inconsistencies are omitted for simplicity.
  • Figure 2: Framework of ContextAnyone. (a) Input: The model takes a text prompt and a reference image $I_r$. The prompt is augmented by a VLM and encoded by an LLM encoder, while the image is processed by two encoders: one for cross-attention guidance and one for latent concatenation. (b) Backbone: The DiT backbone contains stacked blocks with three attention modules. The input latent merges reference latents (green) and noisy video latents (blue), which are decoded separately by the VAE after the DiT blocks. (c) Emphasize Attention: Latents are split into reference and video parts; video latents serve as queries and reference latents as keys and values to reinforce identity. (d) Self-Attention: The attention map is masked so that reference latents do not query video latents, enforcing a one-way information flow from reference to video tokens.
  • Figure 3: Left: Architecture of self-attention with Gap-RoPE applied to Q, K, and V. Upper Right: Standard RoPE assigns continuous temporal indices across all tokens. Lower Right: Gap-RoPE introduces a shift $\beta$ from the generated video frame onward, creating a positional gap between reference and video tokens.
  • Figure 4: Dataset pipeline. Given a ground-truth video, we extract its first frame (green box) and randomly sample action and environment prompt from pools. These prompts and first frame are fed into an image editing model to modify the person’s action and the scene’s illumination. A vision-language model then assesses and filters out invalid edits, after which a segmentation model isolates the person foreground to obtain the final reference image.
  • Figure 5: Qualitative evaluation. Each group shows, from top to bottom, the reference image (Ref.), the results of our method (Ours), Phantom, and VACE. As illustrated, our method produces the most realistic and consistent results in terms of facial identity and overall appearance. In contrast, Phantom and VACE exhibit noticeable artifacts and inconsistencies in facial regions or outfit alignment (highlighted by red boxes). Our approach achieves superior appearance consistency and motion fidelity compared to existing methods.
  • ...and 2 more figures