Table of Contents
Fetching ...

Attributes-aware Visual Emotion Representation Learning

Rahul Singh Maharjan, Marta Romeo, Angelo Cangelosi

TL;DR

This work tackles visual emotion analysis by addressing the affective gap through attribute-aware learning. It introduces A4Net, a multi-branch architecture built on a ConvNeXt-V2 backbone that jointly learns four attributes—brightness, colorfulness, scene, and facial expressions—and fuses their features for emotion classification. Empirical results on EmoSet, EMOTIC, SE30K8, and UnBiasEmo demonstrate improved performance over traditional CNNs and prior attribute-aware methods, with GradCAM visualizations offering interpretability. The approach highlights the value of explicitly modeling perceptual attributes to enhance generalization across diverse visual emotion datasets and provides a foundation for integrating additional cues in future work.

Abstract

Visual emotion analysis or recognition has gained considerable attention due to the growing interest in understanding how images can convey rich semantics and evoke emotions in human perception. However, visual emotion analysis poses distinctive challenges compared to traditional vision tasks, especially due to the intricate relationship between general visual features and the different affective states they evoke, known as the affective gap. Researchers have used deep representation learning methods to address this challenge of extracting generalized features from entire images. However, most existing methods overlook the importance of specific emotional attributes such as brightness, colorfulness, scene understanding, and facial expressions. Through this paper, we introduce A4Net, a deep representation network to bridge the affective gap by leveraging four key attributes: brightness (Attribute 1), colorfulness (Attribute 2), scene context (Attribute 3), and facial expressions (Attribute 4). By fusing and jointly training all aspects of attribute recognition and visual emotion analysis, A4Net aims to provide a better insight into emotional content in images. Experimental results show the effectiveness of A4Net, showcasing competitive performance compared to state-of-the-art methods across diverse visual emotion datasets. Furthermore, visualizations of activation maps generated by A4Net offer insights into its ability to generalize across different visual emotion datasets.

Attributes-aware Visual Emotion Representation Learning

TL;DR

This work tackles visual emotion analysis by addressing the affective gap through attribute-aware learning. It introduces A4Net, a multi-branch architecture built on a ConvNeXt-V2 backbone that jointly learns four attributes—brightness, colorfulness, scene, and facial expressions—and fuses their features for emotion classification. Empirical results on EmoSet, EMOTIC, SE30K8, and UnBiasEmo demonstrate improved performance over traditional CNNs and prior attribute-aware methods, with GradCAM visualizations offering interpretability. The approach highlights the value of explicitly modeling perceptual attributes to enhance generalization across diverse visual emotion datasets and provides a foundation for integrating additional cues in future work.

Abstract

Visual emotion analysis or recognition has gained considerable attention due to the growing interest in understanding how images can convey rich semantics and evoke emotions in human perception. However, visual emotion analysis poses distinctive challenges compared to traditional vision tasks, especially due to the intricate relationship between general visual features and the different affective states they evoke, known as the affective gap. Researchers have used deep representation learning methods to address this challenge of extracting generalized features from entire images. However, most existing methods overlook the importance of specific emotional attributes such as brightness, colorfulness, scene understanding, and facial expressions. Through this paper, we introduce A4Net, a deep representation network to bridge the affective gap by leveraging four key attributes: brightness (Attribute 1), colorfulness (Attribute 2), scene context (Attribute 3), and facial expressions (Attribute 4). By fusing and jointly training all aspects of attribute recognition and visual emotion analysis, A4Net aims to provide a better insight into emotional content in images. Experimental results show the effectiveness of A4Net, showcasing competitive performance compared to state-of-the-art methods across diverse visual emotion datasets. Furthermore, visualizations of activation maps generated by A4Net offer insights into its ability to generalize across different visual emotion datasets.

Paper Structure

This paper contains 20 sections, 11 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: A4Net consists of one backbone network and four attribute branches. Specifically, the color branch is tasked to estimate the color intensity, and the brightness branch is employed to estimate brightness. The scene and facial expression branch is tasked to classify the image into the specific scene and facial expression classes. The feature vector from four branches is fused subsequently to classify visual emotion.
  • Figure 2: Visualization using GradCAM of A4Net trained on the EmoNet Dataset. Words highlighted in blue indicate correct classification. Words highlighted in red indicate cases where A4Net recognizes the wrong class. Words highlighted in green represent classes not present in the test dataset. (Best viewed in Color)
  • Figure 3: GradCAM visualization showcasing performance of A4Net on the SE30K8 Dataset. Words highlighted in blue denote correct classifications. Instances where A4Net identifies classes not present in the test dataset are highlighted in green. (Best viewed in Color)
  • Figure 4: GradCAM visualization of A4Net trained on UnBiasEmo Dataset. The word in blue represents that A4Net can be in the true class correctly. The word in green indicates the class not in the test dataset. (Best viewed in Color)