Table of Contents
Fetching ...

Gloria: Consistent Character Video Generation via Content Anchors

Yuhang Yang, Fan Zhang, Huaijin Pi, Shuai Guo, Guowei Xu, Wei Zhai, Yang Cao, Zheng-Jun Zha

Abstract

Digital characters are central to modern media, yet generating character videos with long-duration, consistent multi-view appearance and expressive identity remains challenging. Existing approaches either provide insufficient context to preserve identity or leverage non-character-centric information as the memory, leading to suboptimal consistency. Recognizing that character video generation inherently resembles an outside-looking-in scenario. In this work, we propose representing the character visual attributes through a compact set of anchor frames. This design provides stable references for consistency, while reference-based video generation inherently faces challenges of copy-pasting and multi-reference conflicts. To address these, we introduce two mechanisms: Superset Content Anchoring, providing intra- and extra-training clip cues to prevent duplication, and RoPE as Weak Condition, encoding positional offsets to distinguish multiple anchors. Furthermore, we construct a scalable pipeline to extract these anchors from massive videos. Experiments show our method generates high-quality character videos exceeding 10 minutes, and achieves expressive identity and appearance consistency across views, surpassing existing methods.

Gloria: Consistent Character Video Generation via Content Anchors

Abstract

Digital characters are central to modern media, yet generating character videos with long-duration, consistent multi-view appearance and expressive identity remains challenging. Existing approaches either provide insufficient context to preserve identity or leverage non-character-centric information as the memory, leading to suboptimal consistency. Recognizing that character video generation inherently resembles an outside-looking-in scenario. In this work, we propose representing the character visual attributes through a compact set of anchor frames. This design provides stable references for consistency, while reference-based video generation inherently faces challenges of copy-pasting and multi-reference conflicts. To address these, we introduce two mechanisms: Superset Content Anchoring, providing intra- and extra-training clip cues to prevent duplication, and RoPE as Weak Condition, encoding positional offsets to distinguish multiple anchors. Furthermore, we construct a scalable pipeline to extract these anchors from massive videos. Experiments show our method generates high-quality character videos exceeding 10 minutes, and achieves expressive identity and appearance consistency across views, surpassing existing methods.

Paper Structure

This paper contains 21 sections, 2 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: Quantitative comparison of long-term consistency.
  • Figure 2: The pipeline to construct training clips and character-centric anchor frames, e.g., global, viewpoint, and expression. The blue arrow marks the subject’s forward orientation, whereas the green arrow marks the camera-facing direction.
  • Figure 3: Overall of the Gloria pipeline, which includes the source of content anchors (Superset Anchors), the manner to inject these anchors (RoPE as Weak Condition), and the overall framework with multi-modal conditions e.g., text and audio.
  • Figure 3: Quantitative comparison of fundamental capability.
  • Figure 4: The user study results of expressive ID and multi-view appearance consistency.
  • ...and 9 more figures