Table of Contents
Fetching ...

Analyzing Image Beyond Visual Aspect: Image Emotion Classification via Multiple-Affective Captioning

Zibo Zhou, Zhengjun Zhai, Huimin Chen, Wei Dai, Hansen Yang

TL;DR

This work tackles image emotion classification by bridging the visual-to-emotion gap with textual intermediate semantics. It introduces ACIEC, which jointly learns hierarchical emotional concepts (ANPs) and affective sentences via Emotional Attribute Chain-of-Thought prompts, enhanced by a semantic-similarity–based contrastive loss and a hierarchical sampling strategy. An OCR-assisted module handles images with embedded text, and a RoBERTa-based fusion of ANP and DES feeds a final emotion prediction. Across four public benchmarks, ACIEC achieves state-of-the-art accuracy, demonstrating the effectiveness of language-informed intermediate representations for IEC and highlighting the value of structured reasoning and text-aware features in emotion understanding.

Abstract

Image emotion classification (IEC) is a longstanding research field that has received increasing attention with the rapid progress of deep learning. Although recent advances have leveraged the knowledge encoded in pre-trained visual models, their effectiveness is constrained by the "affective gap" , limits the applicability of pre-training knowledge for IEC tasks. It has been demonstrated in psychology that language exhibits high variability, encompasses diverse and abundant information, and can effectively eliminate the "affective gap". Inspired by this, we propose a novel Affective Captioning for Image Emotion Classification (ACIEC) to classify image emotion based on pure texts, which effectively capture the affective information in the image. In our method, a hierarchical multi-level contrastive loss is designed for detecting emotional concepts from images, while an emotional attribute chain-of-thought reasoning is proposed to generate affective sentences. Then, a pre-trained language model is leveraged to synthesize emotional concepts and affective sentences to conduct IEC. Additionally, a contrastive loss based on semantic similarity sampling is designed to solve the problem of large intra-class differences and small inter-class differences in affective datasets. Moreover, we also take the images with embedded texts into consideration, which were ignored by previous studies. Extensive experiments illustrate that our method can effectively bridge the affective gap and achieve superior results on multiple benchmarks.

Analyzing Image Beyond Visual Aspect: Image Emotion Classification via Multiple-Affective Captioning

TL;DR

This work tackles image emotion classification by bridging the visual-to-emotion gap with textual intermediate semantics. It introduces ACIEC, which jointly learns hierarchical emotional concepts (ANPs) and affective sentences via Emotional Attribute Chain-of-Thought prompts, enhanced by a semantic-similarity–based contrastive loss and a hierarchical sampling strategy. An OCR-assisted module handles images with embedded text, and a RoBERTa-based fusion of ANP and DES feeds a final emotion prediction. Across four public benchmarks, ACIEC achieves state-of-the-art accuracy, demonstrating the effectiveness of language-informed intermediate representations for IEC and highlighting the value of structured reasoning and text-aware features in emotion understanding.

Abstract

Image emotion classification (IEC) is a longstanding research field that has received increasing attention with the rapid progress of deep learning. Although recent advances have leveraged the knowledge encoded in pre-trained visual models, their effectiveness is constrained by the "affective gap" , limits the applicability of pre-training knowledge for IEC tasks. It has been demonstrated in psychology that language exhibits high variability, encompasses diverse and abundant information, and can effectively eliminate the "affective gap". Inspired by this, we propose a novel Affective Captioning for Image Emotion Classification (ACIEC) to classify image emotion based on pure texts, which effectively capture the affective information in the image. In our method, a hierarchical multi-level contrastive loss is designed for detecting emotional concepts from images, while an emotional attribute chain-of-thought reasoning is proposed to generate affective sentences. Then, a pre-trained language model is leveraged to synthesize emotional concepts and affective sentences to conduct IEC. Additionally, a contrastive loss based on semantic similarity sampling is designed to solve the problem of large intra-class differences and small inter-class differences in affective datasets. Moreover, we also take the images with embedded texts into consideration, which were ignored by previous studies. Extensive experiments illustrate that our method can effectively bridge the affective gap and achieve superior results on multiple benchmarks.

Paper Structure

This paper contains 30 sections, 9 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of the proposed ACIEC.
  • Figure 2: Example of hierarchical structure of VSO dataset.
  • Figure 3: A template of an EA-CoT prompt.
  • Figure 4: A template of an EA-CoT prompt.
  • Figure 5: Confusion matrices for classification results from ACIEC applied to each dataset. Figs. (a) and (b) display results using cross-entropy loss and our proposed loss function with the FI dataset, respectively. Similarly, Figs. (c) and (d) illustrate the corresponding results for the EmotionROI dataset.
  • ...and 2 more figures