Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling

Zilyu Ye; Jinxiu Liu; Ruotian Peng; Jinjin Cao; Zhiyang Chen; Yiyang Zhang; Ziwei Xuan; Mingyuan Zhou; Xiaoqian Shen; Mohamed Elhoseiny; Qi Liu; Guo-Jun Qi

Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling

Zilyu Ye, Jinxiu Liu, Ruotian Peng, Jinjin Cao, Zhiyang Chen, Yiyang Zhang, Ziwei Xuan, Mingyuan Zhou, Xiaoqian Shen, Mohamed Elhoseiny, Qi Liu, Guo-Jun Qi

TL;DR

Openstory++ tackles the difficulty of maintaining instance-level coherence in open-domain visual storytelling by introducing a large-scale, instance-annotated dataset and a narrative-focused training pipeline. It pairs this dataset with Cohere-Bench, a benchmark framework that evaluates long-context image-text generation, including multi-turn storytelling and cross-frame instance consistency. Empirical results show improvements in semantic alignment, style consistency, and instance continuity when training on Openstory++, along with comprehensive human and automated evaluations. Collectively, the work provides both a resource and an evaluation paradigm to drive progress in open-domain, instance-aware visual storytelling and long-context multimodal generation.

Abstract

Recent image generation models excel at creating high-quality images from brief captions. However, they fail to maintain consistency of multiple instances across images when encountering lengthy contexts. This inconsistency is largely due to in existing training datasets the absence of granular instance feature labeling in existing training datasets. To tackle these issues, we introduce Openstory++, a large-scale dataset combining additional instance-level annotations with both images and text. Furthermore, we develop a training methodology that emphasizes entity-centric image-text generation, ensuring that the models learn to effectively interweave visual and textual information. Specifically, Openstory++ streamlines the process of keyframe extraction from open-domain videos, employing vision-language models to generate captions that are then polished by a large language model for narrative continuity. It surpasses previous datasets by offering a more expansive open-domain resource, which incorporates automated captioning, high-resolution imagery tailored for instance count, and extensive frame sequences for temporal consistency. Additionally, we present Cohere-Bench, a pioneering benchmark framework for evaluating the image generation tasks when long multimodal context is provided, including the ability to keep the background, style, instances in the given context coherent. Compared to existing benchmarks, our work fills critical gaps in multi-modal generation, propelling the development of models that can adeptly generate and interpret complex narratives in open-domain environments. Experiments conducted within Cohere-Bench confirm the superiority of Openstory++ in nurturing high-quality visual storytelling models, enhancing their ability to address open-domain generation tasks. More details can be found at https://openstorypp.github.io/

Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling

TL;DR

Abstract

Paper Structure (39 sections, 1 equation, 9 figures, 5 tables)

This paper contains 39 sections, 1 equation, 9 figures, 5 tables.

Introduction
Related Work
Datasets for Story Visualization
Benchmarks for Generative Multi-modal Model
Openstory++
Data Sources
Pipeline Overview
Keyframe Extraction and Deduplication
Single-Image Captioning and Instance-Masking Workflow
Frame-Caption Alignment for Narrative Coherence
Instance Masking
Training Data Challenge
Model Settings
Dataset Comparsion
Cohere-Bench
...and 24 more sections

Figures (9)

Figure 1: The visualization of our dataset. On the left is a data case with visual annotation that corresponds to each entity word in the sentence, where different color stands for different instance visual annotations, and on the right is the general pipeline of our dataset annotation process.
Figure 2: This figure presents a prompt designed to enhance narrative flow and coherence across scenes, which contains refined captioning guidelines aimed at enriching imagery with descriptive details while preserving the core content. Additionally, the prompt emphasizes maintaining consistent instances throughout the storytelling process.
Figure 3: This figure shows the video frame sequence generated by our pipeline, with the subject's mask and the subject's bounding box.
Figure 4: This figure showcases the workflow of our pipeline. After obtaining a sequence of frames devoid of redundancy, we first utilized BLIP2 to generate basic image captions. Subsequently, Video-LLaVA was employed to produce a sequence of captions that encapsulate the narrative flow. Guided by the sequence caption, a LLM was prompted to align the entities in the image captions, thus enhancing the narrative coherence across consecutive frames. Next, YOLO-World was applied to detect bounding boxes for the entities. To ensure that labels for the same entities across frames are unique and consistent, we blended the bounding box labels with the assistance of Dino and a facial feature module. Finally, we employed EfficientVIT-SAM to obtain the masks for the entities, thereby providing a comprehensive understanding of the spatial extent and characteristics of each entity within the frames.
Figure 5: Overview of the interleaved image-text generation: Both the image and text are produced by the MLLM. During image generation, we take the diffusion model as the visual detokenizer.
...and 4 more figures

Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling

TL;DR

Abstract

Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling

Authors

TL;DR

Abstract

Table of Contents

Figures (9)