Table of Contents
Fetching ...

Omni-SILA: Towards Omni-scene Driven Visual Sentiment Identifying, Locating and Attributing in Videos

Jiamin Luo, Jingjing Wang, Junxiao Ma, Yujie Jin, Shoushan Li, Guodong Zhou

TL;DR

The paper defines Omni-SILA to identify, locate, and attribute visual sentiments in videos by leveraging both explicit and implicit scene information. It introduces the Implicit-enhanced Causal MoE (ICM) comprising a Scene-Balanced MoE and an Implicit-Enhanced Causal block, combined with a two-stage training pipeline that includes scene-tuning and Omni-SILA tuning. Empirical results on explicit and implicit Omni-SILA datasets show that ICM outperforms state-of-the-art Video-LLMs, especially in identifying and locating implicit sentiments and providing plausible attributions, while maintaining competitive efficiency. The work demonstrates the value of integrating omni-scene cues and causal interventions for robust, interpretable visual sentiment understanding with practical implications for content safety and moderation.

Abstract

Prior studies on Visual Sentiment Understanding (VSU) primarily rely on the explicit scene information (e.g., facial expression) to judge visual sentiments, which largely ignore implicit scene information (e.g., human action, objection relation and visual background), while such information is critical for precisely discovering visual sentiments. Motivated by this, this paper proposes a new Omni-scene driven visual Sentiment Identifying, Locating and Attributing in videos (Omni-SILA) task, aiming to interactively and precisely identify, locate and attribute visual sentiments through both explicit and implicit scene information. Furthermore, this paper believes that this Omni-SILA task faces two key challenges: modeling scene and highlighting implicit scene beyond explicit. To this end, this paper proposes an Implicit-enhanced Causal MoE (ICM) approach for addressing the Omni-SILA task. Specifically, a Scene-Balanced MoE (SBM) and an Implicit-Enhanced Causal (IEC) blocks are tailored to model scene information and highlight the implicit scene information beyond explicit, respectively. Extensive experimental results on our constructed explicit and implicit Omni-SILA datasets demonstrate the great advantage of the proposed ICM approach over advanced Video-LLMs.

Omni-SILA: Towards Omni-scene Driven Visual Sentiment Identifying, Locating and Attributing in Videos

TL;DR

The paper defines Omni-SILA to identify, locate, and attribute visual sentiments in videos by leveraging both explicit and implicit scene information. It introduces the Implicit-enhanced Causal MoE (ICM) comprising a Scene-Balanced MoE and an Implicit-Enhanced Causal block, combined with a two-stage training pipeline that includes scene-tuning and Omni-SILA tuning. Empirical results on explicit and implicit Omni-SILA datasets show that ICM outperforms state-of-the-art Video-LLMs, especially in identifying and locating implicit sentiments and providing plausible attributions, while maintaining competitive efficiency. The work demonstrates the value of integrating omni-scene cues and causal interventions for robust, interpretable visual sentiment understanding with practical implications for content safety and moderation.

Abstract

Prior studies on Visual Sentiment Understanding (VSU) primarily rely on the explicit scene information (e.g., facial expression) to judge visual sentiments, which largely ignore implicit scene information (e.g., human action, objection relation and visual background), while such information is critical for precisely discovering visual sentiments. Motivated by this, this paper proposes a new Omni-scene driven visual Sentiment Identifying, Locating and Attributing in videos (Omni-SILA) task, aiming to interactively and precisely identify, locate and attribute visual sentiments through both explicit and implicit scene information. Furthermore, this paper believes that this Omni-SILA task faces two key challenges: modeling scene and highlighting implicit scene beyond explicit. To this end, this paper proposes an Implicit-enhanced Causal MoE (ICM) approach for addressing the Omni-SILA task. Specifically, a Scene-Balanced MoE (SBM) and an Implicit-Enhanced Causal (IEC) blocks are tailored to model scene information and highlight the implicit scene information beyond explicit, respectively. Extensive experimental results on our constructed explicit and implicit Omni-SILA datasets demonstrate the great advantage of the proposed ICM approach over advanced Video-LLMs.

Paper Structure

This paper contains 20 sections, 4 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The overall architecture of our ICM approach, consisting of a Scene-Enriched Modeling (SEM) block and an Implicit-enhanced Causal MoE framework, which comprises a Scene-Balanced MoE (SBM) block (right, see Section \ref{['sec:sbm']}) and an Implicit-Enhanced Causal (IEC) block (left, see Section \ref{['sec:iec']}), where (a) and (b) are causal graphs for IEC block. FEE, HAE, ORE and VBE represent Facial Expression Expert, Human Action Expert, Object Relation Expert and Visual Background Expert.
  • Figure 2: Two line charts to compare several well-performing Video-LLMs with our ICM approach on 11 implicit visual sentiments of FNRs (a) and Atr-R (b) two metrics, and the red boxes indicate the categories Vandalism of FNRs and Fire of Atr-R where the performance difference is biggest.
  • Figure 3: Two statistical charts to illustrate the efficiency of our ICM approach. The histogram (a) compares the inference time of ICM with baselines, while the line chart (b) shows the convergence of training losses of ICM, two well-performing Video-LLMs and the variants of ICM across training steps.
  • Figure 4: Two samples to compare ICM with other baselines.