Table of Contents
Fetching ...

StoryNavi: On-Demand Narrative-Driven Reconstruction of Video Play With Generative AI

Alston Lantian Xu, Tianwei Ma, Tianmeng Liu, Can Liu, Alvaro Cassinelli

TL;DR

StoryNavi addresses the challenge of efficiently retrieving information from long videos by enabling non-linear, narrative-driven reconstruction of content through vision-language model powered segment retrieval. It constructs a cohesive narrative from user queries and offers two playback modes to balance fidelity and narrative flow, incorporating transcript data and synthesized narration when needed. Technical evaluation shows robust retrieval performance with recall around 0.886 and precision around 0.682, while user studies reveal improved understanding and engagement for complex content when narrative coherence is preserved. The work demonstrates the practical value of narrative-preserving, AI-assisted video navigation and outlines concrete directions to improve granularity, voice synthesis, and adaptability across video types.

Abstract

Manually navigating lengthy videos to seek information or answer questions can be a tedious and time-consuming task for users. We introduce StoryNavi, a novel system powered by VLLMs for generating customised video play experiences by retrieving materials from original videos. It directly answers users' query by constructing non-linear sequence with identified relevant clips to form a cohesive narrative. StoryNavi offers two modes of playback of the constructed video plays: 1) video-centric, which plays original audio and skips irrelevant segments, and 2) narrative-centric, narration guides the experience, and the original audio is muted. Our technical evaluation showed adequate retrieval performance compared to human retrieval. Our user evaluation shows that maintaining narrative coherence significantly enhances user engagement when viewing disjointed video segments. However, factors like video genre, content, and the query itself may lead to varying user preferences for the playback mode.

StoryNavi: On-Demand Narrative-Driven Reconstruction of Video Play With Generative AI

TL;DR

StoryNavi addresses the challenge of efficiently retrieving information from long videos by enabling non-linear, narrative-driven reconstruction of content through vision-language model powered segment retrieval. It constructs a cohesive narrative from user queries and offers two playback modes to balance fidelity and narrative flow, incorporating transcript data and synthesized narration when needed. Technical evaluation shows robust retrieval performance with recall around 0.886 and precision around 0.682, while user studies reveal improved understanding and engagement for complex content when narrative coherence is preserved. The work demonstrates the practical value of narrative-preserving, AI-assisted video navigation and outlines concrete directions to improve granularity, voice synthesis, and adaptability across video types.

Abstract

Manually navigating lengthy videos to seek information or answer questions can be a tedious and time-consuming task for users. We introduce StoryNavi, a novel system powered by VLLMs for generating customised video play experiences by retrieving materials from original videos. It directly answers users' query by constructing non-linear sequence with identified relevant clips to form a cohesive narrative. StoryNavi offers two modes of playback of the constructed video plays: 1) video-centric, which plays original audio and skips irrelevant segments, and 2) narrative-centric, narration guides the experience, and the original audio is muted. Our technical evaluation showed adequate retrieval performance compared to human retrieval. Our user evaluation shows that maintaining narrative coherence significantly enhances user engagement when viewing disjointed video segments. However, factors like video genre, content, and the query itself may lead to varying user preferences for the playback mode.
Paper Structure (51 sections, 2 equations, 10 figures, 4 tables)

This paper contains 51 sections, 2 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Prototype user interface for query-based video playback. The interface includes multiple panels: (A)Video Panel displaying pre-processed videos with a timeline indicating relevant (blue) and irrelevant (grey) segments, (B)Query Panel where users can input queries to retrieve relevant segments, (C)Control Panel offering playback options for segment-specific control, (D)Playback Order Panel listing all relevant segments with customisable playback order, and (E)Summary Panel providing brief descriptions of both relevant and irrelevant segments for quick content overview. Note: the video content has been replaced with an AI-generated image due to copyright concerns.
  • Figure 2: Prototype pipeline. 1) Image and audio extraction from the video. 2) Frame annotations using GPT-4o. 3) Frame Retrieval based on user query. 4) Refine segments. 5) Output results.
  • Figure 3: Illustration of StoryNavi pipeline. 1) Image and audio extraction from the video. 2) Frame Annotation using GPT-4o and transcription using Whisper. 3) Frame retrieval based on user query. 4) Refine segments. 5) Narrative generation. 6) Playback Mode, either video-centric or narrative-centric
  • Figure 4: Illustration of segment retrieval and construction of two playback modes.
  • Figure 5: Screenshot of video-centric playback mode, featuring an artificial “title card” between segments that displays LLM-generated transition sentences. Note: the video content has been replaced with an AI-generated image due to copyright concerns.
  • ...and 5 more figures