Table of Contents
Fetching ...

MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation

Haochen Xue, Feilong Tang, Ming Hu, Yexin Liu, Qidong Huang, Yulong Li, Chengzhi Liu, Zhongxing Xu, Chong Zhang, Chun-Mei Feng, Yutong Xie, Imran Razzak, Zongyuan Ge, Jionglong Su, Junjun He, Yu Qiao

TL;DR

MMRC introduces a large-scale, multi-image open-ended conversation benchmark to evaluate six core abilities of multimodal LLMs in real-world dialogue. It combines a real-world data platform (DialogFlow) with a triplet-based evaluation framework (S, q, a) and a comprehensive mix of GPT-based, human, and precision metrics, revealing persistent gaps and four common failure patterns in current models. The authors propose a Note-taking strategy that externalizes memory and facts as structured notes, achieving meaningful improvements across information extraction, information update, memory recall, and related abilities. This benchmark and the accompanying findings offer a practical path toward developing more reliable, memory-aware MLLMs for real-world, long-horizon conversations.

Abstract

Recent multimodal large language models (MLLMs) have demonstrated significant potential in open-ended conversation, generating more accurate and personalized responses. However, their abilities to memorize, recall, and reason in sustained interactions within real-world scenarios remain underexplored. This paper introduces MMRC, a Multi-Modal Real-world Conversation benchmark for evaluating six core open-ended abilities of MLLMs: information extraction, multi-turn reasoning, information update, image management, memory recall, and answer refusal. With data collected from real-world scenarios, MMRC comprises 5,120 conversations and 28,720 corresponding manually labeled questions, posing a significant challenge to existing MLLMs. Evaluations on 20 MLLMs in MMRC indicate an accuracy drop during open-ended interactions. We identify four common failure patterns: long-term memory degradation, inadequacies in updating factual knowledge, accumulated assumption of error propagation, and reluctance to say no. To mitigate these issues, we propose a simple yet effective NOTE-TAKING strategy, which can record key information from the conversation and remind the model during its responses, enhancing conversational capabilities. Experiments across six MLLMs demonstrate significant performance improvements.

MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation

TL;DR

MMRC introduces a large-scale, multi-image open-ended conversation benchmark to evaluate six core abilities of multimodal LLMs in real-world dialogue. It combines a real-world data platform (DialogFlow) with a triplet-based evaluation framework (S, q, a) and a comprehensive mix of GPT-based, human, and precision metrics, revealing persistent gaps and four common failure patterns in current models. The authors propose a Note-taking strategy that externalizes memory and facts as structured notes, achieving meaningful improvements across information extraction, information update, memory recall, and related abilities. This benchmark and the accompanying findings offer a practical path toward developing more reliable, memory-aware MLLMs for real-world, long-horizon conversations.

Abstract

Recent multimodal large language models (MLLMs) have demonstrated significant potential in open-ended conversation, generating more accurate and personalized responses. However, their abilities to memorize, recall, and reason in sustained interactions within real-world scenarios remain underexplored. This paper introduces MMRC, a Multi-Modal Real-world Conversation benchmark for evaluating six core open-ended abilities of MLLMs: information extraction, multi-turn reasoning, information update, image management, memory recall, and answer refusal. With data collected from real-world scenarios, MMRC comprises 5,120 conversations and 28,720 corresponding manually labeled questions, posing a significant challenge to existing MLLMs. Evaluations on 20 MLLMs in MMRC indicate an accuracy drop during open-ended interactions. We identify four common failure patterns: long-term memory degradation, inadequacies in updating factual knowledge, accumulated assumption of error propagation, and reluctance to say no. To mitigate these issues, we propose a simple yet effective NOTE-TAKING strategy, which can record key information from the conversation and remind the model during its responses, enhancing conversational capabilities. Experiments across six MLLMs demonstrate significant performance improvements.

Paper Structure

This paper contains 24 sections, 4 equations, 30 figures, 6 tables, 1 algorithm.

Figures (30)

  • Figure 1: Illustration of the six core multimodal open-ended conversation abilities in the MMRC benchmark.
  • Figure 2: A sample from the MMRC, featuring a multi-turn open-ended conversation with six human-annotated questions and answers, designed to assess the ability of MLLMs in open-ended conversations.
  • Figure 3: Data construction pipeline of MMRC.
  • Figure 4: The distribution of dialogue turns in MMRC, ConvBench, and EvalDial.
  • Figure 5: The distribution of conversation categories in our MMRC dataset.
  • ...and 25 more figures