Table of Contents
Fetching ...

Enhancing Multimodal Affective Analysis with Learned Live Comment Features

Zhaoyuan Deng, Amith Ananthram, Kathleen McKeown

TL;DR

This work uses contrastive learning to train a video encoder to produce synthetic live comment features for enhanced multimodal affective content analysis, and demonstrates that these synthetic live comment features significantly improve performance over state-of-the-art methods.

Abstract

Live comments, also known as Danmaku, are user-generated messages that are synchronized with video content. These comments overlay directly onto streaming videos, capturing viewer emotions and reactions in real-time. While prior work has leveraged live comments in affective analysis, its use has been limited due to the relative rarity of live comments across different video platforms. To address this, we first construct the Live Comment for Affective Analysis (LCAffect) dataset which contains live comments for English and Chinese videos spanning diverse genres that elicit a wide spectrum of emotions. Then, using this dataset, we use contrastive learning to train a video encoder to produce synthetic live comment features for enhanced multimodal affective content analysis. Through comprehensive experimentation on a wide range of affective analysis tasks (sentiment, emotion recognition, and sarcasm detection) in both English and Chinese, we demonstrate that these synthetic live comment features significantly improve performance over state-of-the-art methods.

Enhancing Multimodal Affective Analysis with Learned Live Comment Features

TL;DR

This work uses contrastive learning to train a video encoder to produce synthetic live comment features for enhanced multimodal affective content analysis, and demonstrates that these synthetic live comment features significantly improve performance over state-of-the-art methods.

Abstract

Live comments, also known as Danmaku, are user-generated messages that are synchronized with video content. These comments overlay directly onto streaming videos, capturing viewer emotions and reactions in real-time. While prior work has leveraged live comments in affective analysis, its use has been limited due to the relative rarity of live comments across different video platforms. To address this, we first construct the Live Comment for Affective Analysis (LCAffect) dataset which contains live comments for English and Chinese videos spanning diverse genres that elicit a wide spectrum of emotions. Then, using this dataset, we use contrastive learning to train a video encoder to produce synthetic live comment features for enhanced multimodal affective content analysis. Through comprehensive experimentation on a wide range of affective analysis tasks (sentiment, emotion recognition, and sarcasm detection) in both English and Chinese, we demonstrate that these synthetic live comment features significantly improve performance over state-of-the-art methods.

Paper Structure

This paper contains 24 sections, 4 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: An example video frame and transcript from the TV show The Big Bang Theory with accompanying live comments overlaid. Below each original Chinese comment, we include its English translation. By training a video encoder to produce representations similar to these live comments, we can produce multimodal features that adapt well to affective analysis tasks like emotion recognition.
  • Figure 2: Our contrastive pre-training approach. We train our V2LC encoder to predict the correct pairings of a batch of (video, live comment) training examples. Correct pairings of video segment embedding $S$ and live comment embedding $C$ are highlighted in blue. Comments ${c_1}^3$ and ${c_3}^2$ have high similarity; thus they are correct matches for both Segment 1 and 3.