Table of Contents
Fetching ...

Assessment of Multimodal Large Language Models in Alignment with Human Values

Zhelun Shi, Zhipin Wang, Hongxing Fan, Zaibin Zhang, Lijun Li, Yongting Zhang, Zhenfei Yin, Lu Sheng, Yu Qiao, Jing Shao

TL;DR

This work tackles the challenge of aligning multimodal large language models with human values by introducing the Ch3Ef dataset, the first comprehensive $A3$ benchmark for helpful, honest, and harmless evaluation, along with a unified evaluation strategy that operates across $A1$-$A3$. The dataset comprises 1002 human-annotated QA samples across 12 domains and 46 tasks, created via Human-Machine Synergy and evaluated with a modular framework (Instruction-Inferencer-Metric) and multiple Recipe paradigms. The authors report over 10 findings detailing model strengths, weaknesses, and the relationships between perception, reasoning, and value-aligned behavior, highlighting trade-offs between safety and engagement and the particular challenges faced by open-source MLLMs. The work provides a concrete path toward robust, real-world alignment of MLLMs and offers a reusable benchmark and evaluation framework to guide future development and benchmarking in this area.

Abstract

Large Language Models (LLMs) aim to serve as versatile assistants aligned with human values, as defined by the principles of being helpful, honest, and harmless (hhh). However, in terms of Multimodal Large Language Models (MLLMs), despite their commendable performance in perception and reasoning tasks, their alignment with human values remains largely unexplored, given the complexity of defining hhh dimensions in the visual world and the difficulty in collecting relevant data that accurately mirrors real-world situations. To address this gap, we introduce Ch3Ef, a Compreh3ensive Evaluation dataset and strategy for assessing alignment with human expectations. Ch3Ef dataset contains 1002 human-annotated data samples, covering 12 domains and 46 tasks based on the hhh principle. We also present a unified evaluation strategy supporting assessment across various scenarios and different perspectives. Based on the evaluation results, we summarize over 10 key findings that deepen the understanding of MLLM capabilities, limitations, and the dynamic relationships between evaluation levels, guiding future advancements in the field.

Assessment of Multimodal Large Language Models in Alignment with Human Values

TL;DR

This work tackles the challenge of aligning multimodal large language models with human values by introducing the Ch3Ef dataset, the first comprehensive benchmark for helpful, honest, and harmless evaluation, along with a unified evaluation strategy that operates across -. The dataset comprises 1002 human-annotated QA samples across 12 domains and 46 tasks, created via Human-Machine Synergy and evaluated with a modular framework (Instruction-Inferencer-Metric) and multiple Recipe paradigms. The authors report over 10 findings detailing model strengths, weaknesses, and the relationships between perception, reasoning, and value-aligned behavior, highlighting trade-offs between safety and engagement and the particular challenges faced by open-source MLLMs. The work provides a concrete path toward robust, real-world alignment of MLLMs and offers a reusable benchmark and evaluation framework to guide future development and benchmarking in this area.

Abstract

Large Language Models (LLMs) aim to serve as versatile assistants aligned with human values, as defined by the principles of being helpful, honest, and harmless (hhh). However, in terms of Multimodal Large Language Models (MLLMs), despite their commendable performance in perception and reasoning tasks, their alignment with human values remains largely unexplored, given the complexity of defining hhh dimensions in the visual world and the difficulty in collecting relevant data that accurately mirrors real-world situations. To address this gap, we introduce Ch3Ef, a Compreh3ensive Evaluation dataset and strategy for assessing alignment with human expectations. Ch3Ef dataset contains 1002 human-annotated data samples, covering 12 domains and 46 tasks based on the hhh principle. We also present a unified evaluation strategy supporting assessment across various scenarios and different perspectives. Based on the evaluation results, we summarize over 10 key findings that deepen the understanding of MLLM capabilities, limitations, and the dynamic relationships between evaluation levels, guiding future advancements in the field.
Paper Structure (48 sections, 20 figures, 9 tables)

This paper contains 48 sections, 20 figures, 9 tables.

Figures (20)

  • Figure 1: Overview of Evaluation for MLLMs. The evaluation for MLLMs is categorized into three ascending levels of alignment. The examples for each alignment level are displayed in the upper half. The benchmarks and evaluated dimensions are illustrated at each level. C$h^3$Ef dataset is the first comprehensive A3 dataset on hhh (helpful, honest, harmless) criteria, and the evaluation strategy can be used to evaluate MLLMs on various scenarios across A1-A3 spectra.
  • Figure 2: C$h^3$Ef dataset's taxonomy and statistics. (a) The taxonomy emphasizing the hhh criteria, systematically outlines 4/3/5 domains and 22/7/17 tasks for each h respectively. (b) Details of the domains and tasks.
  • Figure 3: Data Samples in C$h^3$Ef Dataset. Each sample comprises one or more images, accompanied by a meticulously human annotated question and several options. The correct option is indicated in bold.
  • Figure 4: Creation Process of C$h^3$Ef Dataset. It includes image collection from existing datasets and generation models, along with question and option annotation using Human-Machine Synergy.
  • Figure 5: Overview of C$h^3$Ef Evaluation Strategy. It comprises three compatible modules, i.e., Instruction, Inferencer and Metric, enabling different Recipes (specific selections of each module) to facilitate evaluations from different perspectives across various scenarios ranging from A1-A3 spectra. The right side shows different Recipes for evaluating different dimensions, including location (Locat.), QA performance (QA Perf.), in-context learning performance (ICL Perf.), calibration (Calib.) and alignment with human values (Human-value).
  • ...and 15 more figures