Table of Contents
Fetching ...

AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models

Yuhang Wu, Wenmeng Yu, Yean Cheng, Yan Wang, Xiaohan Zhang, Jiazheng Xu, Ming Ding, Yuxiao Dong

TL;DR

本文聚焦中文视觉-语言模型对齐评估的缺口,提出 AlignMMBench 作为覆盖13项任务、3大类别、含单-turn与多-turn对话的数据集,规模为1,054张图与4,978问答,并配套 CritiqueVLM 评测框架以实现可控、鲁棒的对齐评估。通过提示重写策略提升评估稳定性,并引入对齐分数来量化同义问题集上的结果一致性,将模型分为四类以揭示对齐能力与总体性能的关系。实证结果显示中文训练数据对对齐评估至关重要,GPT-4o 在多数任务中领先,而 CritiqueVLM 在与人类一致性方面表现优于 GPT-4,且其参数规模更小。该工作提供一个可公开访问的中文对齐基准与评测工具,促进中文 VLM 的对齐能力研究与比较。

Abstract

Evaluating the alignment capabilities of large Vision-Language Models (VLMs) is essential for determining their effectiveness as helpful assistants. However, existing benchmarks primarily focus on basic abilities using nonverbal methods, such as yes-no and multiple-choice questions. In this paper, we address this gap by introducing AlignMMBench, which provides more nuanced evaluations of alignment capabilities and is the first benchmark specifically designed for Chinese visual contexts. This benchmark is meticulously curated from real-world scenarios and internet sources, encompassing thirteen specific tasks across three categories, and includes both single-turn and multi-turn dialogue scenarios. Incorporating a prompt rewrite strategy, AlignMMBench encompasses 1,054 images and 4,978 question-answer pairs. To facilitate the evaluation pipeline, we develop CritiqueVLM, a rule-calibrated evaluator that exceeds GPT-4's evaluation ability. Additionally, we measure the "alignment score", a quantitative metric designed to assess the robustness and stability of models across diverse prompts. Finally, we evaluate the performance of representative VLMs on AlignMMBench, offering insights into the capabilities and limitations of different VLM architectures. The evaluation code and data are available at https://github.com/THUDM/AlignMMBench.

AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models

TL;DR

本文聚焦中文视觉-语言模型对齐评估的缺口,提出 AlignMMBench 作为覆盖13项任务、3大类别、含单-turn与多-turn对话的数据集,规模为1,054张图与4,978问答,并配套 CritiqueVLM 评测框架以实现可控、鲁棒的对齐评估。通过提示重写策略提升评估稳定性,并引入对齐分数来量化同义问题集上的结果一致性,将模型分为四类以揭示对齐能力与总体性能的关系。实证结果显示中文训练数据对对齐评估至关重要,GPT-4o 在多数任务中领先,而 CritiqueVLM 在与人类一致性方面表现优于 GPT-4,且其参数规模更小。该工作提供一个可公开访问的中文对齐基准与评测工具,促进中文 VLM 的对齐能力研究与比较。

Abstract

Evaluating the alignment capabilities of large Vision-Language Models (VLMs) is essential for determining their effectiveness as helpful assistants. However, existing benchmarks primarily focus on basic abilities using nonverbal methods, such as yes-no and multiple-choice questions. In this paper, we address this gap by introducing AlignMMBench, which provides more nuanced evaluations of alignment capabilities and is the first benchmark specifically designed for Chinese visual contexts. This benchmark is meticulously curated from real-world scenarios and internet sources, encompassing thirteen specific tasks across three categories, and includes both single-turn and multi-turn dialogue scenarios. Incorporating a prompt rewrite strategy, AlignMMBench encompasses 1,054 images and 4,978 question-answer pairs. To facilitate the evaluation pipeline, we develop CritiqueVLM, a rule-calibrated evaluator that exceeds GPT-4's evaluation ability. Additionally, we measure the "alignment score", a quantitative metric designed to assess the robustness and stability of models across diverse prompts. Finally, we evaluate the performance of representative VLMs on AlignMMBench, offering insights into the capabilities and limitations of different VLM architectures. The evaluation code and data are available at https://github.com/THUDM/AlignMMBench.
Paper Structure (44 sections, 1 equation, 12 figures, 4 tables)

This paper contains 44 sections, 1 equation, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Performance vs. Alignment score of various models. Performances are ranged from 0 to 10, while Alignment scores are ranged from 0.2 to $\infty$.
  • Figure 2: Categories and examples of AlignMMBench. The chart on the left displays the categories of AlignMMBench, encompassing three main categories and thirteen specific tasks. The numbers listed under each category represent the number of images in that category and the corresponding percentage of the total. The right side of the pie chart presents two examples, illustrating instances from the incoherence and coherence tasks.
  • Figure 3: Overall framework of our work.
  • Figure 4:
  • Figure 5: Radar chart of leaderboard results.
  • ...and 7 more figures