Table of Contents
Fetching ...

2K-Characters-10K-Stories: A Quality-Gated Stylized Narrative Dataset with Disentangled Control and Sequence Consistency

Xingxi Yin, Yicheng Li, Gong Yan, Chenglin Li, Jian Zhao, Cong Huang, Yue Deng, Yin Zhang

TL;DR

This work addresses the lack of robust identity coherence and fine-grained control in visual storytelling by introducing 2K-Characters-10K-Stories, a large-scale, multi-modal stylized narrative dataset built with a advanced Human-in-the-Loop pipeline. It provides granular, decoupled control signals for identity and transient attributes, enforced through a Quality-Gated loop that combines MMLM auto-evaluation, Auto-Prompt Tuning, and Local Image Editing to achieve pixel-level sequence consistency. The authors validate that training on IQCC-verified data yields substantial gains for open-source models like OmniGen2, surpassing several baselines and approaching closed-source performance in narrative quality and control fidelity. Overall, the dataset and HiL methodology establish a new quality standard for controllable visual narratives and pave the way for scalable, fine-grained, multi-character storytelling research and applications.

Abstract

Sequential identity consistency under precise transient attribute control remains a long-standing challenge in controllable visual storytelling. Existing datasets lack sufficient fidelity and fail to disentangle stable identities from transient attributes, limiting structured control over pose, expression, and scene composition and thus constraining reliable sequential synthesis. To address this gap, we introduce \textbf{2K-Characters-10K-Stories}, a multi-modal stylized narrative dataset of \textbf{2{,}000} uniquely stylized characters appearing across \textbf{10{,}000} illustration stories. It is the first dataset that pairs large-scale unique identities with explicit, decoupled control signals for sequential identity consistency. We introduce a \textbf{Human-in-the-Loop pipeline (HiL)} that leverages expert-verified character templates and LLM-guided narrative planning to generate highly-aligned structured data. A \textbf{decoupled control} scheme separates persistent identity from transient attributes -- pose and expression -- while a \textbf{Quality-Gated loop} integrating MMLM evaluation, Auto-Prompt Tuning, and Local Image Editing enforces pixel-level consistency. Extensive experiments demonstrate that models fine-tuned on our dataset achieves performance comparable to closed-source models in generating visual narratives.

2K-Characters-10K-Stories: A Quality-Gated Stylized Narrative Dataset with Disentangled Control and Sequence Consistency

TL;DR

This work addresses the lack of robust identity coherence and fine-grained control in visual storytelling by introducing 2K-Characters-10K-Stories, a large-scale, multi-modal stylized narrative dataset built with a advanced Human-in-the-Loop pipeline. It provides granular, decoupled control signals for identity and transient attributes, enforced through a Quality-Gated loop that combines MMLM auto-evaluation, Auto-Prompt Tuning, and Local Image Editing to achieve pixel-level sequence consistency. The authors validate that training on IQCC-verified data yields substantial gains for open-source models like OmniGen2, surpassing several baselines and approaching closed-source performance in narrative quality and control fidelity. Overall, the dataset and HiL methodology establish a new quality standard for controllable visual narratives and pave the way for scalable, fine-grained, multi-character storytelling research and applications.

Abstract

Sequential identity consistency under precise transient attribute control remains a long-standing challenge in controllable visual storytelling. Existing datasets lack sufficient fidelity and fail to disentangle stable identities from transient attributes, limiting structured control over pose, expression, and scene composition and thus constraining reliable sequential synthesis. To address this gap, we introduce \textbf{2K-Characters-10K-Stories}, a multi-modal stylized narrative dataset of \textbf{2{,}000} uniquely stylized characters appearing across \textbf{10{,}000} illustration stories. It is the first dataset that pairs large-scale unique identities with explicit, decoupled control signals for sequential identity consistency. We introduce a \textbf{Human-in-the-Loop pipeline (HiL)} that leverages expert-verified character templates and LLM-guided narrative planning to generate highly-aligned structured data. A \textbf{decoupled control} scheme separates persistent identity from transient attributes -- pose and expression -- while a \textbf{Quality-Gated loop} integrating MMLM evaluation, Auto-Prompt Tuning, and Local Image Editing enforces pixel-level consistency. Extensive experiments demonstrate that models fine-tuned on our dataset achieves performance comparable to closed-source models in generating visual narratives.

Paper Structure

This paper contains 17 sections, 1 equation, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Our Structured HiL Pipeline synthesizes high-fidelity visual narratives. The left panel illustrates the multi-character identity templates. The middle panel enables precise manipulation of transient attributes for a character identity. The right panels demonstrate robust sequential identity preservation across complex visual narratives, validating the fidelity of our Quality-Gated synthesis process.
  • Figure 2: Three-Phase Human-in-the-Loop (HiL) Pipeline for Quality-Gated, Controllable Narrative Data Synthesis. This HiL pipeline enforces semantic control and quality assurance across four phases. Phase I (Design) establishes visual constraints, including Character ID templates (Human Validation 1). Phase II (Encoding) transforms narrative text into the Structured Frame Metadata (Human Validation 2). The core generation process occurs in Phase III (Synthesis and Resolution). After image generation, the output is immediately routed through the Integrated Quality Control and Correction (IQCC). The IQCC unifies the Triple-Check diagnosis and initiates a targeted automated resolution strategy (APT or LIE) based on the scope and type of error.
  • Figure 3: Ablation Study of data Synthesis Strategies. This comparison demonstrates the necessity of our HiL control mechanism before the IQCC process. Row 1 (Text-Only) shows that raw frame descriptions result in low fidelity and uncontrolled attribute variation. Row 2 (Text Augmented) shows marginal improvements, but still lacks the geometric precision required for complex control. Critically, Rows 4 and 5 illustrate the adopted two-step strategy: using a pose/expression reference image as a dedicated control signal is essential for achieving stable, high-fidelity $C_{ID}$ preservation and accurate transient attribute control.
  • Figure 4: Joint Distribution of 2K Characters by Subject-Type and Profession. This concise heatmap visualizes the character pool's diversity across eight distinct subject-types (X-axis) and four distinct profession categories (Y-axis). The values represent the raw count for each intersection, confirming a robust and balanced representation of characters.
  • Figure 5: Statistical Distribution of 10K-Stories Complexity. This figure validates the dataset's scale and structured design. (Left) The Character Number Distribution is intentionally balanced across 1 to 5 characters, ensuring broad social complexity. (Right) The Sequence Length Distribution confirms the dataset's focus on sequential consistency, primarily featuring stories between 6 and 9 frames.
  • ...and 1 more figures