Table of Contents
Fetching ...

HERM: Benchmarking and Enhancing Multimodal LLMs for Human-Centric Understanding

Keliang Li, Zaifei Yang, Jiahe Zhao, Hongze Shen, Ruibing Hou, Hong Chang, Shiguang Shan, Xilin Chen

TL;DR

HerM-Bench, a benchmark for evaluating the human-centric understanding capabilities of Multimodal Large Language Models, and HERM-7B, a MLLM that leverages enhanced training data from HERM-100K, a comprehensive dataset with multi-level human-centric annotations aimed at enhancing MLLMs' training.

Abstract

The significant advancements in visual understanding and instruction following from Multimodal Large Language Models (MLLMs) have opened up more possibilities for broader applications in diverse and universal human-centric scenarios. However, existing image-text data may not support the precise modality alignment and integration of multi-grained information, which is crucial for human-centric visual understanding. In this paper, we introduce HERM-Bench, a benchmark for evaluating the human-centric understanding capabilities of MLLMs. Our work reveals the limitations of existing MLLMs in understanding complex human-centric scenarios. To address these challenges, we present HERM-100K, a comprehensive dataset with multi-level human-centric annotations, aimed at enhancing MLLMs' training. Furthermore, we develop HERM-7B, a MLLM that leverages enhanced training data from HERM-100K. Evaluations on HERM-Bench demonstrate that HERM-7B significantly outperforms existing MLLMs across various human-centric dimensions, reflecting the current inadequacy of data annotations used in MLLM training for human-centric visual understanding. This research emphasizes the importance of specialized datasets and benchmarks in advancing the MLLMs' capabilities for human-centric understanding.

HERM: Benchmarking and Enhancing Multimodal LLMs for Human-Centric Understanding

TL;DR

HerM-Bench, a benchmark for evaluating the human-centric understanding capabilities of Multimodal Large Language Models, and HERM-7B, a MLLM that leverages enhanced training data from HERM-100K, a comprehensive dataset with multi-level human-centric annotations aimed at enhancing MLLMs' training.

Abstract

The significant advancements in visual understanding and instruction following from Multimodal Large Language Models (MLLMs) have opened up more possibilities for broader applications in diverse and universal human-centric scenarios. However, existing image-text data may not support the precise modality alignment and integration of multi-grained information, which is crucial for human-centric visual understanding. In this paper, we introduce HERM-Bench, a benchmark for evaluating the human-centric understanding capabilities of MLLMs. Our work reveals the limitations of existing MLLMs in understanding complex human-centric scenarios. To address these challenges, we present HERM-100K, a comprehensive dataset with multi-level human-centric annotations, aimed at enhancing MLLMs' training. Furthermore, we develop HERM-7B, a MLLM that leverages enhanced training data from HERM-100K. Evaluations on HERM-Bench demonstrate that HERM-7B significantly outperforms existing MLLMs across various human-centric dimensions, reflecting the current inadequacy of data annotations used in MLLM training for human-centric visual understanding. This research emphasizes the importance of specialized datasets and benchmarks in advancing the MLLMs' capabilities for human-centric understanding.

Paper Structure

This paper contains 38 sections, 21 figures, 9 tables.

Figures (21)

  • Figure 1: Overview of HERM. (1) We construct HERM-Bench, the first human-centric multi-modal benchmark. (2) We propose HERM-100K with multi-level human annotations. (3) We develop HERM-7B, a MLLM achieving state-of-the-art performance on human-centric basic perception and complex understanding.
  • Figure 2: Human-related information distribution in COCO captions. (a)/(b): heatmaps representing the average number of characters/words to describe various aspects (appearance, action, etc.), grouped by the person number in each image (ranging from 1-3 to 13+). It is observed that descriptions of all sides in COCO are limited to a few words and become increasingly inadequate in scenes with a larger number of people.
  • Figure 3: Taxonomy and examples of HERM-Bench. HERM-Bench includes 8 evaluation dimensions on basic perception and complex understanding fields. The number in bracket denotes question number of each evaluation dimension.
  • Figure 4: The overall pipeline of constructing HERM-100K, HERM-Bench and training data. First, we derive HERM-100K using powerful off-the-shelf foundation models. Then, using visual annotations in HERM-100K, we create multitask training and instruction tuning data, as well as prompting GPT-4 to build HERM-Bench.
  • Figure 5: Evaluation examples on HERM-Bench. We compare outputs of LLaVA, MiniGPT-v2 and HERM-7B. We mark error parts in red, while correct parts in blue.
  • ...and 16 more figures