Table of Contents
Fetching ...

MOSABench: Multi-Object Sentiment Analysis Benchmark for Evaluating Multimodal Large Language Models Understanding of Complex Image

Shezheng Song, Chengxiang He, Shan Zhao, Chengyu Wang, Qian Wan, Tianwei Yan, Meng Wang

TL;DR

This work introduces MOSABench, a benchmark engineered to evaluate multimodal large language models on multi-object sentiment analysis within complex images. It combines distance-aware object annotation, standardized post-processing of LLM outputs, and a specialized multi-object scoring scheme to assess how well models infer sentiments for multiple targets in a single image. Comprehensive experiments across open- and closed-source MLLMs reveal that most models struggle with multi-object sentiment and that performance degrades as spatial distance between targets increases, with notable successes from models like mPLUG-Owl, Qwen-VL2-7B, and ERNIE Bot. The dataset, analyses (including attention visualizations and confusion matrices), and scoring framework establish MOSABench as a foundational tool to drive targeted improvements in perception, reasoning, and instruction design for complex multimodal sentiment understanding.

Abstract

Multimodal large language models (MLLMs) have shown remarkable progress in high-level semantic tasks such as visual question answering, image captioning, and emotion recognition. However, despite advancements, there remains a lack of standardized benchmarks for evaluating MLLMs performance in multi-object sentiment analysis, a key task in semantic understanding. To address this gap, we introduce MOSABench, a novel evaluation dataset designed specifically for multi-object sentiment analysis. MOSABench includes approximately 1,000 images with multiple objects, requiring MLLMs to independently assess the sentiment of each object, thereby reflecting real-world complexities. Key innovations in MOSABench include distance-based target annotation, post-processing for evaluation to standardize outputs, and an improved scoring mechanism. Our experiments reveal notable limitations in current MLLMs: while some models, like mPLUG-owl and Qwen-VL2, demonstrate effective attention to sentiment-relevant features, others exhibit scattered focus and performance declines, especially as the spatial distance between objects increases. This research underscores the need for MLLMs to enhance accuracy in complex, multi-object sentiment analysis tasks and establishes MOSABench as a foundational tool for advancing sentiment analysis capabilities in MLLMs.

MOSABench: Multi-Object Sentiment Analysis Benchmark for Evaluating Multimodal Large Language Models Understanding of Complex Image

TL;DR

This work introduces MOSABench, a benchmark engineered to evaluate multimodal large language models on multi-object sentiment analysis within complex images. It combines distance-aware object annotation, standardized post-processing of LLM outputs, and a specialized multi-object scoring scheme to assess how well models infer sentiments for multiple targets in a single image. Comprehensive experiments across open- and closed-source MLLMs reveal that most models struggle with multi-object sentiment and that performance degrades as spatial distance between targets increases, with notable successes from models like mPLUG-Owl, Qwen-VL2-7B, and ERNIE Bot. The dataset, analyses (including attention visualizations and confusion matrices), and scoring framework establish MOSABench as a foundational tool to drive targeted improvements in perception, reasoning, and instruction design for complex multimodal sentiment understanding.

Abstract

Multimodal large language models (MLLMs) have shown remarkable progress in high-level semantic tasks such as visual question answering, image captioning, and emotion recognition. However, despite advancements, there remains a lack of standardized benchmarks for evaluating MLLMs performance in multi-object sentiment analysis, a key task in semantic understanding. To address this gap, we introduce MOSABench, a novel evaluation dataset designed specifically for multi-object sentiment analysis. MOSABench includes approximately 1,000 images with multiple objects, requiring MLLMs to independently assess the sentiment of each object, thereby reflecting real-world complexities. Key innovations in MOSABench include distance-based target annotation, post-processing for evaluation to standardize outputs, and an improved scoring mechanism. Our experiments reveal notable limitations in current MLLMs: while some models, like mPLUG-owl and Qwen-VL2, demonstrate effective attention to sentiment-relevant features, others exhibit scattered focus and performance declines, especially as the spatial distance between objects increases. This research underscores the need for MLLMs to enhance accuracy in complex, multi-object sentiment analysis tasks and establishes MOSABench as a foundational tool for advancing sentiment analysis capabilities in MLLMs.

Paper Structure

This paper contains 17 sections, 13 figures, 3 tables, 1 algorithm.

Figures (13)

  • Figure 1: F1 comparison on MOSABench across multimodal large language models
  • Figure 2: Example from a previous dataset, illustrating single-object data with a lack of instruction adaptation for MLLM.
  • Figure 3: Example from our MOSABench.
  • Figure 4: Data example of our MOSABench. "Instruction" specifies the task that the LLM needs to perform, while "Answer" represents the expected result of the task execution. "Objects" indicates the targets present in the image, requiring the LLM to complete the task in the "Instruction" by integrating the "Text" and "Image".
  • Figure 5: Distance label examples in MOSABench. "Interlap" indicates that the bounding boxes of the individuals overlap. "Close" denotes that the boxes do not overlap but the distance between them is less than $L/k$, where $L$ is the image length and $k$ is a hyperparameter. "Far" signifies that the distance between the boxes exceeds $L/k$.
  • ...and 8 more figures