2K-Characters-10K-Stories: A Quality-Gated Stylized Narrative Dataset with Disentangled Control and Sequence Consistency
Xingxi Yin, Yicheng Li, Gong Yan, Chenglin Li, Jian Zhao, Cong Huang, Yue Deng, Yin Zhang
TL;DR
This work addresses the lack of robust identity coherence and fine-grained control in visual storytelling by introducing 2K-Characters-10K-Stories, a large-scale, multi-modal stylized narrative dataset built with a advanced Human-in-the-Loop pipeline. It provides granular, decoupled control signals for identity and transient attributes, enforced through a Quality-Gated loop that combines MMLM auto-evaluation, Auto-Prompt Tuning, and Local Image Editing to achieve pixel-level sequence consistency. The authors validate that training on IQCC-verified data yields substantial gains for open-source models like OmniGen2, surpassing several baselines and approaching closed-source performance in narrative quality and control fidelity. Overall, the dataset and HiL methodology establish a new quality standard for controllable visual narratives and pave the way for scalable, fine-grained, multi-character storytelling research and applications.
Abstract
Sequential identity consistency under precise transient attribute control remains a long-standing challenge in controllable visual storytelling. Existing datasets lack sufficient fidelity and fail to disentangle stable identities from transient attributes, limiting structured control over pose, expression, and scene composition and thus constraining reliable sequential synthesis. To address this gap, we introduce \textbf{2K-Characters-10K-Stories}, a multi-modal stylized narrative dataset of \textbf{2{,}000} uniquely stylized characters appearing across \textbf{10{,}000} illustration stories. It is the first dataset that pairs large-scale unique identities with explicit, decoupled control signals for sequential identity consistency. We introduce a \textbf{Human-in-the-Loop pipeline (HiL)} that leverages expert-verified character templates and LLM-guided narrative planning to generate highly-aligned structured data. A \textbf{decoupled control} scheme separates persistent identity from transient attributes -- pose and expression -- while a \textbf{Quality-Gated loop} integrating MMLM evaluation, Auto-Prompt Tuning, and Local Image Editing enforces pixel-level consistency. Extensive experiments demonstrate that models fine-tuned on our dataset achieves performance comparable to closed-source models in generating visual narratives.
