Table of Contents
Fetching ...

MMRel: Benchmarking Relation Understanding in Multi-Modal Large Language Models

Jiahao Nie, Gongjie Zhang, Wenbin An, Yun Xing, Yap-Peng Tan, Alex C. Kot, Shijian Lu

TL;DR

This work presents MMRel, a large-scale, high-quality benchmark dedicated to inter-object relation understanding in multi-modal LLMs. It defines a clear taxonomy (spatial, action, comparative) and adds an adversarial subset to probe hallucinations, using a semi-automatic data collection pipeline that combines GPT-4V annotations with human verification and DALL-E–generated images. Through evaluations of 28 MLLMs on Yes/No and open-ended tasks, MMRel reveals persistent gaps in relation understanding across domains and models, and demonstrates that fine-tuning with MMRel substantially improves performance and reduces hallucinations. The findings underscore the value of diverse data, precise relation definitions, and reasoning-enabled architectures for advancing vision-language perception tasks.

Abstract

Though Multi-modal Large Language Models (MLLMs) have recently achieved significant progress, they often struggle to understand diverse and complicated inter-object relations. Specifically, the lack of large-scale and high-quality relation data has greatly hindered the progress of MLLMs in various vision-language perception tasks. We attempt to address this challenge by contributing the Multi-Modal Relation Understanding benchmark (MMRel), which features large-scale, high-quality, and diverse data on inter-object relations. MMRel has three distinctive attributes: (i) it contains 22,500 question-answer pairs spanning three distinct domains and around 400 relations, ensuring both scale and diversity; (ii) it provides manually verified, high-quality labels to ensure exceptional annotation accuracy; and (iii) it includes adversarial cases with highly unusual relations, offering a challenging setting for evaluating relation hallucination. These features make MMRel ideal for evaluating MLLMs on relation understanding, as well as for fine-tuning MLLMs to enhance relation comprehension capability. Extensive experiments on 28 MLLMs demonstrate the effectiveness of MMRel in both evaluating and enhancing MLLMs' relation understanding, and the accompanying analyses provide insights for future research. The benchmark has been made publicly available at: https://niejiahao1998.github.io/MMRel

MMRel: Benchmarking Relation Understanding in Multi-Modal Large Language Models

TL;DR

This work presents MMRel, a large-scale, high-quality benchmark dedicated to inter-object relation understanding in multi-modal LLMs. It defines a clear taxonomy (spatial, action, comparative) and adds an adversarial subset to probe hallucinations, using a semi-automatic data collection pipeline that combines GPT-4V annotations with human verification and DALL-E–generated images. Through evaluations of 28 MLLMs on Yes/No and open-ended tasks, MMRel reveals persistent gaps in relation understanding across domains and models, and demonstrates that fine-tuning with MMRel substantially improves performance and reduces hallucinations. The findings underscore the value of diverse data, precise relation definitions, and reasoning-enabled architectures for advancing vision-language perception tasks.

Abstract

Though Multi-modal Large Language Models (MLLMs) have recently achieved significant progress, they often struggle to understand diverse and complicated inter-object relations. Specifically, the lack of large-scale and high-quality relation data has greatly hindered the progress of MLLMs in various vision-language perception tasks. We attempt to address this challenge by contributing the Multi-Modal Relation Understanding benchmark (MMRel), which features large-scale, high-quality, and diverse data on inter-object relations. MMRel has three distinctive attributes: (i) it contains 22,500 question-answer pairs spanning three distinct domains and around 400 relations, ensuring both scale and diversity; (ii) it provides manually verified, high-quality labels to ensure exceptional annotation accuracy; and (iii) it includes adversarial cases with highly unusual relations, offering a challenging setting for evaluating relation hallucination. These features make MMRel ideal for evaluating MLLMs on relation understanding, as well as for fine-tuning MLLMs to enhance relation comprehension capability. Extensive experiments on 28 MLLMs demonstrate the effectiveness of MMRel in both evaluating and enhancing MLLMs' relation understanding, and the accompanying analyses provide insights for future research. The benchmark has been made publicly available at: https://niejiahao1998.github.io/MMRel
Paper Structure (15 sections, 10 figures, 10 tables)

This paper contains 15 sections, 10 figures, 10 tables.

Figures (10)

  • Figure 1: The proposed Multi-Modal Relation Understanding benchmark (MMRel) features large-scale, high-quality, and diverse data. Left: MMRel covers three domains that span seven subsets. Right: MMRel comprises 22,500 QA pairs on 804 objects and 394 kinds of relations.
  • Figure 2: Multi-Modal Large Language Models tend to fail in understanding inter-object relations.
  • Figure 3: Sample images from the MMRel benchmark. MMRel consists of three categories of inter-object relations: spatial, action, and comparative relations. The images are sourced from three domains: (a) real images, (b) synthetic images generated by SDXL, and (c) images generated by Dall-E. More Dall-E samples are shown in Fig. \ref{['fig:dall-e_sample']} of the appendix.
  • Figure 4: Limitations of existing benchmarks: Implausible negative choices which can be easily ruled out, as illustrated in (a) and (b); Complex and subjective evaluation metrics as in (c); Incomplete and ambiguous relation annotation of "contact" only as in (d). In contrast, MMRel mitigates these limitations by: (i) providing a comprehensive taxonomy of inter-object relations, (ii) carefully designing plausible negative action relations, and (iii) adopting both discriminativeYes/No and generativeopen-ended evaluations.
  • Figure 5: We propose a Semi-automatic Data Collection (SemiDC) pipeline that constructs MMRel with two distinctive approaches: (a) re-label images of the Visual Genome krishna2017visual with GPT-4V achiam2023gpt, and (b) generate synthetic images with Dall-E betker2023improving. Both approaches are followed by human verification to ensure the data quality.
  • ...and 5 more figures