Dual-path Collaborative Generation Network for Emotional Video Captioning

Cheng Ye; Weidong Chen; Jingyu Li; Lei Zhang; Zhendong Mao

Dual-path Collaborative Generation Network for Emotional Video Captioning

Cheng Ye, Weidong Chen, Jingyu Li, Lei Zhang, Zhendong Mao

TL;DR

The paper addresses emotional video captioning by modeling dynamic emotion evolution and balancing emotional guidance with factual content. It introduces a Dual-path Collaborative Generation Network with a Dynamic Emotion Perception Path that evolves emotion cues and an Adaptive Caption Generation Path that gates emotional guidance via an Emotion Adaptive Decoder. Key contributions include the dynamic emotion evolution module (element-level and subspace-level), the emotion adaptive decoder with emotion intensity estimation and dual losses, and extensive experiments on three EmVidCap datasets showing state-of-the-art results, especially when using CLIP features. The approach demonstrates strong expressiveness and generalization, including transfer to emotional image captioning tasks.

Abstract

Emotional Video Captioning is an emerging task that aims to describe factual content with the intrinsic emotions expressed in videos. The essential of the EVC task is to effectively perceive subtle and ambiguous visual emotional cues during the caption generation, which is neglected by the traditional video captioning. Existing emotional video captioning methods perceive global visual emotional cues at first, and then combine them with the video features to guide the emotional caption generation, which neglects two characteristics of the EVC task. Firstly, their methods neglect the dynamic subtle changes in the intrinsic emotions of the video, which makes it difficult to meet the needs of common scenes with diverse and changeable emotions. Secondly, as their methods incorporate emotional cues into each step, the guidance role of emotion is overemphasized, which makes factual content more or less ignored during generation. To this end, we propose a dual-path collaborative generation network, which dynamically perceives visual emotional cues evolutions while generating emotional captions by collaborative learning. Specifically, in the dynamic emotion perception path, we propose a dynamic emotion evolution module, which first aggregates visual features and historical caption features to summarize the global visual emotional cues, and then dynamically selects emotional cues required to be re-composed at each stage. Besides, in the adaptive caption generation path, to balance the description of factual content and emotional cues, we propose an emotion adaptive decoder. Thus, our methods can generate emotion-related words at the necessary time step, and our caption generation balances the guidance of factual content and emotional cues well. Extensive experiments on three challenging datasets demonstrate the superiority of our approach and each proposed module.

Dual-path Collaborative Generation Network for Emotional Video Captioning

TL;DR

Abstract

Paper Structure (20 sections, 19 equations, 5 figures, 6 tables)

This paper contains 20 sections, 19 equations, 5 figures, 6 tables.

Introduction
Related Work
Visual Emotional Analysis
Emotional Video Captioning
Proposed Method
Video and Emotion Feature Extraction
Dynamic Emotion Perception Path
Adaptive Caption Generation Path
Experiments
Datasets and Evaluation Metrics
Implementation Details
Main Results
Ablation Studies
Qualitative Results
Emotional Image Captioning
...and 5 more sections

Figures (5)

Figure 1: Motivation of our method, which collaborates the dynamic emotion perception and adaptive caption generation. Our method can generate accurate emotional captions.
Figure 2: The overview of our proposed dual-path collaborative generation network. It mainly consists of the dynamic emotion perception and the adaptive caption generation.
Figure 3: The framework of our dynamic emotion evolution.
Figure 4: The framework of our adaptive caption generation.
Figure 5: Qualitative results on EVC-VE dataset. The figure shows different emotions at different stages of the videos. Moreover, we mark typical emotion-related and emotion-irrelevant words on our generated captions with green and purple respectively to demonstrate the effects of our model. The incorrect prediction is marked as red.

Dual-path Collaborative Generation Network for Emotional Video Captioning

TL;DR

Abstract

Dual-path Collaborative Generation Network for Emotional Video Captioning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)