Table of Contents
Fetching ...

Improving Visual Storytelling with Multimodal Large Language Models

Xiaochuan Lin, Xiangyong Chen

TL;DR

The paper tackles visual storytelling by aligning visual sequences with coherent narratives through a framework that combines LLMs and LVLMs with instruction tuning. It introduces a diverse, multimodal dataset and a learning strategy that fuses supervised fine-tuning with reinforcement learning guided by GPT-4-based rewards, improving narrative coherence, relevance, and emotional depth. Quantitative and human evaluations show clear advantages over strong baselines, highlighting the effectiveness of instruction tuning and GPT-4-driven assessment in advancing multimodal storytelling. The approach offers a practical pathway to more engaging, temporally consistent visual narratives across domains, with potential for broader multimodal extensions and applications.

Abstract

Visual storytelling is an emerging field that combines images and narratives to create engaging and contextually rich stories. Despite its potential, generating coherent and emotionally resonant visual stories remains challenging due to the complexity of aligning visual and textual information. This paper presents a novel approach leveraging large language models (LLMs) and large vision-language models (LVLMs) combined with instruction tuning to address these challenges. We introduce a new dataset comprising diverse visual stories, annotated with detailed captions and multimodal elements. Our method employs a combination of supervised and reinforcement learning to fine-tune the model, enhancing its narrative generation capabilities. Quantitative evaluations using GPT-4 and qualitative human assessments demonstrate that our approach significantly outperforms existing models, achieving higher scores in narrative coherence, relevance, emotional depth, and overall quality. The results underscore the effectiveness of instruction tuning and the potential of LLMs/LVLMs in advancing visual storytelling.

Improving Visual Storytelling with Multimodal Large Language Models

TL;DR

The paper tackles visual storytelling by aligning visual sequences with coherent narratives through a framework that combines LLMs and LVLMs with instruction tuning. It introduces a diverse, multimodal dataset and a learning strategy that fuses supervised fine-tuning with reinforcement learning guided by GPT-4-based rewards, improving narrative coherence, relevance, and emotional depth. Quantitative and human evaluations show clear advantages over strong baselines, highlighting the effectiveness of instruction tuning and GPT-4-driven assessment in advancing multimodal storytelling. The approach offers a practical pathway to more engaging, temporally consistent visual narratives across domains, with potential for broader multimodal extensions and applications.

Abstract

Visual storytelling is an emerging field that combines images and narratives to create engaging and contextually rich stories. Despite its potential, generating coherent and emotionally resonant visual stories remains challenging due to the complexity of aligning visual and textual information. This paper presents a novel approach leveraging large language models (LLMs) and large vision-language models (LVLMs) combined with instruction tuning to address these challenges. We introduce a new dataset comprising diverse visual stories, annotated with detailed captions and multimodal elements. Our method employs a combination of supervised and reinforcement learning to fine-tune the model, enhancing its narrative generation capabilities. Quantitative evaluations using GPT-4 and qualitative human assessments demonstrate that our approach significantly outperforms existing models, achieving higher scores in narrative coherence, relevance, emotional depth, and overall quality. The results underscore the effectiveness of instruction tuning and the potential of LLMs/LVLMs in advancing visual storytelling.
Paper Structure (16 sections, 4 equations, 4 tables)