Table of Contents
Fetching ...

NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control

Yufan Wen, Zhaocheng Liu, YeGuo Hua, Ziyi Guo, Lihua Zhang, Chun Yuan, Jian Wu

TL;DR

NarraScore addresses the challenge of long-form video soundtrack generation by introducing a hierarchical framework that distills narrative logic into continuous emotion signals. It leverages frozen Vision-Language Models as affective sensors to obtain frame-level Valence-Arousal trajectories and employs a Dual-Branch Injection with a Global Semantic Anchor and a Token-Level Affective Adapter to enforce global stylistic coherence while modulating local tension. Key contributions include a narrative-aware affective reasoning module, a holistic musical conceptualization step, and a lightweight, data-efficient hierarchical acoustic synthesis that preserves the pretrained audio backbone. The approach demonstrates state-of-the-art coherence and narrative alignment with negligible computational overhead, enabling autonomous, scalable soundtrack generation for long-form videos across diverse genres.

Abstract

Synthesizing coherent soundtracks for long-form videos remains a formidable challenge, currently stalled by three critical impediments: computational scalability, temporal coherence, and, most critically, a pervasive semantic blindness to evolving narrative logic. To bridge these gaps, we propose NarraScore, a hierarchical framework predicated on the core insight that emotion serves as a high-density compression of narrative logic. Uniquely, we repurpose frozen Vision-Language Models (VLMs) as continuous affective sensors, distilling high-dimensional visual streams into dense, narrative-aware Valence-Arousal trajectories. Mechanistically, NarraScore employs a Dual-Branch Injection strategy to reconcile global structure with local dynamism: a \textit{Global Semantic Anchor} ensures stylistic stability, while a surgical \textit{Token-Level Affective Adapter} modulates local tension via direct element-wise residual injection. This minimalist design bypasses the bottlenecks of dense attention and architectural cloning, effectively mitigating the overfitting risks associated with data scarcity. Experiments demonstrate that NarraScore achieves state-of-the-art consistency and narrative alignment with negligible computational overhead, establishing a fully autonomous paradigm for long-video soundtrack generation.

NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control

TL;DR

NarraScore addresses the challenge of long-form video soundtrack generation by introducing a hierarchical framework that distills narrative logic into continuous emotion signals. It leverages frozen Vision-Language Models as affective sensors to obtain frame-level Valence-Arousal trajectories and employs a Dual-Branch Injection with a Global Semantic Anchor and a Token-Level Affective Adapter to enforce global stylistic coherence while modulating local tension. Key contributions include a narrative-aware affective reasoning module, a holistic musical conceptualization step, and a lightweight, data-efficient hierarchical acoustic synthesis that preserves the pretrained audio backbone. The approach demonstrates state-of-the-art coherence and narrative alignment with negligible computational overhead, enabling autonomous, scalable soundtrack generation for long-form videos across diverse genres.

Abstract

Synthesizing coherent soundtracks for long-form videos remains a formidable challenge, currently stalled by three critical impediments: computational scalability, temporal coherence, and, most critically, a pervasive semantic blindness to evolving narrative logic. To bridge these gaps, we propose NarraScore, a hierarchical framework predicated on the core insight that emotion serves as a high-density compression of narrative logic. Uniquely, we repurpose frozen Vision-Language Models (VLMs) as continuous affective sensors, distilling high-dimensional visual streams into dense, narrative-aware Valence-Arousal trajectories. Mechanistically, NarraScore employs a Dual-Branch Injection strategy to reconcile global structure with local dynamism: a \textit{Global Semantic Anchor} ensures stylistic stability, while a surgical \textit{Token-Level Affective Adapter} modulates local tension via direct element-wise residual injection. This minimalist design bypasses the bottlenecks of dense attention and architectural cloning, effectively mitigating the overfitting risks associated with data scarcity. Experiments demonstrate that NarraScore achieves state-of-the-art consistency and narrative alignment with negligible computational overhead, establishing a fully autonomous paradigm for long-video soundtrack generation.
Paper Structure (35 sections, 7 equations, 4 figures, 4 tables)

This paper contains 35 sections, 7 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Narrative-aware video background music generation. Unlike baselines that rely on surface visuals and fail to capture narrative tension or hidden subtext, our approach leverages global plot $S_{global}$ and local emotion $E_{local}$ to generate soundtracks that are temporally coherent and narratively resonant.
  • Figure 2: Overview of our framework
  • Figure 3: Our Method of Token-Wise Control Injection.
  • Figure 4: Visualization of the generated spectrograms and the corresponding narrative emotion curves.