Movie2Story: A framework for understanding videos and telling stories in the form of novel text
Kangning Li, Zheyang Jia, Anyu Ying
TL;DR
MSBench targets the gap in evaluating multi-modal understanding for long video contexts with rich auxiliary information by introducing automated data generation and five narrative-centered tasks. The core method couples a Foundation Models Pool with a structured M2S-LLM pipeline to extract video and audio features, align them temporally, and generate novel, long-form stories with improved coherence and knowledge integration. The paper introduces new reference-free metrics (Language Fluency, InfoSim, InfoDiverse) and qualitative prompts (GPT Assistant Metrics, SACOR) to assess narrative quality beyond surface language, demonstrating that existing MLLMs lag in long-context, multi-modal storytelling and that M2S-LLM achieves notable gains (≈15%) on key metrics over baselines. Overall, MSBench provides a practical, scalable framework for evaluating and advancing narrative-focused multi-modal understanding, with the potential to enhance how AI systems comprehend and describe complex multimedia content.
Abstract
In recent years, large-scale models have achieved significant advancements, accompanied by the emergence of numerous high-quality benchmarks for evaluating various aspects of their comprehension abilities. However, most existing benchmarks primarily focus on spatial understanding in static image tasks. While some benchmarks extend evaluations to temporal tasks, they fall short in assessing text generation under complex contexts involving long videos and rich auxiliary information. To address this limitation, we propose a novel benchmark: the Multi-modal Story Generation Benchmark (MSBench), designed to evaluate text generation capabilities in scenarios enriched with auxiliary information. Our work introduces an innovative automatic dataset generation method to ensure the availability of accurate auxiliary information. On one hand, we leverage existing datasets and apply automated processes to generate new evaluation datasets, significantly reducing manual efforts. On the other hand, we refine auxiliary data through systematic filtering and utilize state-of-the-art models to ensure the fairness and accuracy of the ground-truth datasets. Our experiments reveal that current Multi-modal Large Language Models (MLLMs) perform suboptimally under the proposed evaluation metrics, highlighting significant gaps in their capabilities. To address these challenges, we propose a novel model architecture and methodology to better handle the overall process, demonstrating improvements on our benchmark.
