Table of Contents
Fetching ...

M$^3$oralBench: A MultiModal Moral Benchmark for LVLMs

Bei Yan, Jie Zhang, Zhiyuan Chen, Shiguang Shan, Xilin Chen

TL;DR

M³oralBench addresses the gap in evaluating morality for LVLMs by building a multimodal benchmark grounded in Moral Foundations Theory. It expands Moral Foundations Vignettes with GPT-4o-generated scenarios and SD3.0-generated images, adding dialogues to enrich context, and defines three tasks—moral judgement, moral classification, and moral response—across six moral foundations. The study benchmarks 10 LVLMs (open and closed) using Monte Carlo-based option likelihoods, revealing that closed-source models generally outperform open-source ones, with moral classification being the most challenging task. The benchmark provides a practical, multimodal framework for assessing and guiding alignment of LVLMs with human values, while highlighting areas (notably Loyalty/Betrayal and Sanctity) where current models struggle and improvements are needed.

Abstract

Recently, large foundation models, including large language models (LLMs) and large vision-language models (LVLMs), have become essential tools in critical fields such as law, finance, and healthcare. As these models increasingly integrate into our daily life, it is necessary to conduct moral evaluation to ensure that their outputs align with human values and remain within moral boundaries. Previous works primarily focus on LLMs, proposing moral datasets and benchmarks limited to text modality. However, given the rapid development of LVLMs, there is still a lack of multimodal moral evaluation methods. To bridge this gap, we introduce M$^3$oralBench, the first MultiModal Moral Benchmark for LVLMs. M$^3$oralBench expands the everyday moral scenarios in Moral Foundations Vignettes (MFVs) and employs the text-to-image diffusion model, SD3.0, to create corresponding scenario images. It conducts moral evaluation across six moral foundations of Moral Foundations Theory (MFT) and encompasses tasks in moral judgement, moral classification, and moral response, providing a comprehensive assessment of model performance in multimodal moral understanding and reasoning. Extensive experiments on 10 popular open-source and closed-source LVLMs demonstrate that M$^3$oralBench is a challenging benchmark, exposing notable moral limitations in current models. Our benchmark is publicly available.

M$^3$oralBench: A MultiModal Moral Benchmark for LVLMs

TL;DR

M³oralBench addresses the gap in evaluating morality for LVLMs by building a multimodal benchmark grounded in Moral Foundations Theory. It expands Moral Foundations Vignettes with GPT-4o-generated scenarios and SD3.0-generated images, adding dialogues to enrich context, and defines three tasks—moral judgement, moral classification, and moral response—across six moral foundations. The study benchmarks 10 LVLMs (open and closed) using Monte Carlo-based option likelihoods, revealing that closed-source models generally outperform open-source ones, with moral classification being the most challenging task. The benchmark provides a practical, multimodal framework for assessing and guiding alignment of LVLMs with human values, while highlighting areas (notably Loyalty/Betrayal and Sanctity) where current models struggle and improvements are needed.

Abstract

Recently, large foundation models, including large language models (LLMs) and large vision-language models (LVLMs), have become essential tools in critical fields such as law, finance, and healthcare. As these models increasingly integrate into our daily life, it is necessary to conduct moral evaluation to ensure that their outputs align with human values and remain within moral boundaries. Previous works primarily focus on LLMs, proposing moral datasets and benchmarks limited to text modality. However, given the rapid development of LVLMs, there is still a lack of multimodal moral evaluation methods. To bridge this gap, we introduce MoralBench, the first MultiModal Moral Benchmark for LVLMs. MoralBench expands the everyday moral scenarios in Moral Foundations Vignettes (MFVs) and employs the text-to-image diffusion model, SD3.0, to create corresponding scenario images. It conducts moral evaluation across six moral foundations of Moral Foundations Theory (MFT) and encompasses tasks in moral judgement, moral classification, and moral response, providing a comprehensive assessment of model performance in multimodal moral understanding and reasoning. Extensive experiments on 10 popular open-source and closed-source LVLMs demonstrate that MoralBench is a challenging benchmark, exposing notable moral limitations in current models. Our benchmark is publicly available.
Paper Structure (23 sections, 1 equation, 12 figures, 7 tables)

This paper contains 23 sections, 1 equation, 12 figures, 7 tables.

Figures (12)

  • Figure 1: An overview of the entire pipeline for M³oralBench construction. We use GPT-4o to expand the Moral Foundations Vignettes, creating a set of moral violation scenarios. For image generation, we further employ GPT-4o to transform these scenarios into image generation prompts with details on location and character, as well as main character dialogues. The moral scenario images are then generated by SD3.0, with dialogues incorporated by speech bubbles. For instruction generation, we apply an instruction template gallery to produce task-specific instructions and reference answers.
  • Figure 2: Examples of moral scenarios in MFVs violated different moral foundations.
  • Figure 3: An example of scenario expansion from the MFVs. Character and location details in the scenarios are underlined.
  • Figure 4: An example of image generation prompt and main character dialogue. The main character in the scenario is underlined.
  • Figure 5: Examples of M³oralBench evaluation for different moral tasks. Moral judgement requires the model to assess whether the behavior depicted in the top-left image is morally wrong. Moral classification demands the model to identify the specific moral foundation violated in the top-left image. Moral response challenges the model to choose the appropriate response in the context of the bottom-left images.
  • ...and 7 more figures