Table of Contents
Fetching ...

EmoCaliber: Advancing Reliable Visual Emotion Comprehension via Confidence Verbalization and Calibration

Daiqing Wu, Dongbao Yang, Can Ma, Yu Zhou

TL;DR

EmoCaliber addresses the unreliability of deterministic VEC outputs by enabling explicit confidence verbalization and calibration. It introduces a three-stage training framework and a VEC-CoT data pipeline to foster structured affective reasoning, along with a unified VECBench for fair ID and OOD evaluation. Empirical results show improvements in both emotion prediction accuracy and confidence calibration, validating robustness across diverse data. This work provides a practical baseline and dataset pipeline to advance reliable visual emotion understanding in multimodal systems.

Abstract

Visual Emotion Comprehension (VEC) aims to infer sentiment polarities or emotion categories from affective cues embedded in images. In recent years, Multimodal Large Language Models (MLLMs) have established a popular paradigm in VEC, leveraging their generalizability to unify VEC tasks defined under diverse emotion taxonomies. While this paradigm achieves notable success, it typically formulates VEC as a deterministic task, requiring the model to output a single, definitive emotion label for each image. Such a formulation insufficiently accounts for the inherent subjectivity of emotion perception, overlooking alternative interpretations that may be equally plausible to different viewers. To address this limitation, we propose equipping MLLMs with capabilities to verbalize their confidence in emotion predictions. This additional signal provides users with an estimate of both the plausibility of alternative interpretations and the MLLMs' self-assessed competence, thereby enhancing reliability in practice. Building on this insight, we introduce a three-stage training framework that progressively endows with structured reasoning, teaches to verbalize confidence, and calibrates confidence expression, culminating in EmoCaliber, a confidence-aware MLLM for VEC. Through fair and comprehensive evaluations on the unified benchmark VECBench, EmoCaliber demonstrates overall superiority against existing methods in both emotion prediction and confidence estimation. These results validate the effectiveness of our approach and mark a feasible step toward more reliable VEC systems. Project page: https://github.com/wdqqdw/EmoCaliber.

EmoCaliber: Advancing Reliable Visual Emotion Comprehension via Confidence Verbalization and Calibration

TL;DR

EmoCaliber addresses the unreliability of deterministic VEC outputs by enabling explicit confidence verbalization and calibration. It introduces a three-stage training framework and a VEC-CoT data pipeline to foster structured affective reasoning, along with a unified VECBench for fair ID and OOD evaluation. Empirical results show improvements in both emotion prediction accuracy and confidence calibration, validating robustness across diverse data. This work provides a practical baseline and dataset pipeline to advance reliable visual emotion understanding in multimodal systems.

Abstract

Visual Emotion Comprehension (VEC) aims to infer sentiment polarities or emotion categories from affective cues embedded in images. In recent years, Multimodal Large Language Models (MLLMs) have established a popular paradigm in VEC, leveraging their generalizability to unify VEC tasks defined under diverse emotion taxonomies. While this paradigm achieves notable success, it typically formulates VEC as a deterministic task, requiring the model to output a single, definitive emotion label for each image. Such a formulation insufficiently accounts for the inherent subjectivity of emotion perception, overlooking alternative interpretations that may be equally plausible to different viewers. To address this limitation, we propose equipping MLLMs with capabilities to verbalize their confidence in emotion predictions. This additional signal provides users with an estimate of both the plausibility of alternative interpretations and the MLLMs' self-assessed competence, thereby enhancing reliability in practice. Building on this insight, we introduce a three-stage training framework that progressively endows with structured reasoning, teaches to verbalize confidence, and calibrates confidence expression, culminating in EmoCaliber, a confidence-aware MLLM for VEC. Through fair and comprehensive evaluations on the unified benchmark VECBench, EmoCaliber demonstrates overall superiority against existing methods in both emotion prediction and confidence estimation. These results validate the effectiveness of our approach and mark a feasible step toward more reliable VEC systems. Project page: https://github.com/wdqqdw/EmoCaliber.

Paper Structure

This paper contains 22 sections, 6 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: In addition to a structured Chain-of-Thought (CoT) and the derived answer, EmoCaliber also produces a self-evaluated confidence level. It enables users to adopt the output selectively, significantly enhancing the reliability of the VEC system.
  • Figure 2: Illustration of the model’s evolution across the three training stages. Through these stages, the model is successively endowed with structured reasoning, taught to verbalize confidence, and finally calibrated to express confidence accurately.
  • Figure 3: Task composition of VECBench. The training split comprises Visual Sentiment Analysis (VSA) and Visual Emotion Recognition (VER) tasks, with each subtask denoted as "source-granularity (#sample)". In addition to retaining corresponding subtasks for in-domain (ID) evaluation, the test split also includes out-of-domain (OOD) VER tasks to verify generalization ability.
  • Figure 4: Construction pipeline of the VEC-CoT dataset from the training split of VECBench. Image–label pairs are first templatized and fed into proprietary MLLMs to synthesize structured CoTs, which are then subjected to strict quality evaluation. Finally, image–text pairs with high-quality CoTs are retained and grouped into the VEC-CoT dataset.
  • Figure 5: Illustration of the 2nd training stage. The scaffold model first performs inference on unseen data, producing a CoT and an emotion prediction. This prediction is then mapped onto a VAD lexicon-based emotion loop to measure its normalized distance from the ground-truth label. The estimated confidence score, derived from this distance, is directly appended to the original CoT and prediction, forming the supervision data used for SFT.
  • ...and 2 more figures