Table of Contents
Fetching ...

InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback

Henry Hengyuan Zhao, Wenqi Pei, Yifei Tao, Haiyang Mei, Mike Zheng Shou

TL;DR

InterFeedback tackles the problem of evaluating and enabling interactive intelligence in large multimodal models by introducing a formal framework and benchmark suite. The InterFeedback-Bench uses a POMDP-based formulation and two challenging datasets, MMMU-Pro and MathVerse, to automate interactive problem-solving and self-improvement from feedback, complemented by InterFeedback-Human for manual assessments. Across 12 open-source LMMs and several proprietary providers, findings show that interactive feedback can improve performance on some tasks but most models struggle to meaningfully leverage feedback, and that accuracy alone does not capture a model's capacity to benefit from feedback. The work also emphasizes that feedback quality matters and that simple, binary feedback can sometimes outperform more detailed but noisier explanations, highlighting the need for better mechanisms to interpret and incorporate feedback in LMMs with practical implications for human-AI collaboration.

Abstract

Existing benchmarks do not test Large Multimodal Models (LMMs) on their interactive intelligence with human users, which is vital for developing general-purpose AI assistants. We design InterFeedback, an interactive framework, which can be applied to any LMM and dataset to assess this ability autonomously. On top of this, we introduce InterFeedback-Bench which evaluates interactive intelligence using two representative datasets, MMMU-Pro and MathVerse, to test 10 different open-source LMMs. Additionally, we present InterFeedback-Human, a newly collected dataset of 120 cases designed for manually testing interactive performance in leading models such as OpenAI-o1 and Claude-Sonnet-4. Our evaluation results indicate that even the state-of-the-art LMM, OpenAI-o1, struggles to refine its responses based on human feedback, achieving an average score of less than 50%. Our findings point to the need for methods that can enhance LMMs' capabilities to interpret and benefit from feedback.

InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback

TL;DR

InterFeedback tackles the problem of evaluating and enabling interactive intelligence in large multimodal models by introducing a formal framework and benchmark suite. The InterFeedback-Bench uses a POMDP-based formulation and two challenging datasets, MMMU-Pro and MathVerse, to automate interactive problem-solving and self-improvement from feedback, complemented by InterFeedback-Human for manual assessments. Across 12 open-source LMMs and several proprietary providers, findings show that interactive feedback can improve performance on some tasks but most models struggle to meaningfully leverage feedback, and that accuracy alone does not capture a model's capacity to benefit from feedback. The work also emphasizes that feedback quality matters and that simple, binary feedback can sometimes outperform more detailed but noisier explanations, highlighting the need for better mechanisms to interpret and incorporate feedback in LMMs with practical implications for human-AI collaboration.

Abstract

Existing benchmarks do not test Large Multimodal Models (LMMs) on their interactive intelligence with human users, which is vital for developing general-purpose AI assistants. We design InterFeedback, an interactive framework, which can be applied to any LMM and dataset to assess this ability autonomously. On top of this, we introduce InterFeedback-Bench which evaluates interactive intelligence using two representative datasets, MMMU-Pro and MathVerse, to test 10 different open-source LMMs. Additionally, we present InterFeedback-Human, a newly collected dataset of 120 cases designed for manually testing interactive performance in leading models such as OpenAI-o1 and Claude-Sonnet-4. Our evaluation results indicate that even the state-of-the-art LMM, OpenAI-o1, struggles to refine its responses based on human feedback, achieving an average score of less than 50%. Our findings point to the need for methods that can enhance LMMs' capabilities to interpret and benefit from feedback.

Paper Structure

This paper contains 19 sections, 2 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Illustration of an interactive feedback scenario. When models generate incorrect responses, human users provide pertinent feedback to interactively refine the answers.
  • Figure 2: Overview of the test data construction process for InterFeedback-Bench. For each LMM serving as the feedback receiver, we process each instance from a target dataset (e.g., MathVerse) and collect the error cases to form a negative set. The feedback provider then processes the same instances to build a positive set. Finally, we curate test data by selecting the intersection of both sets.
  • Figure 3: Overview of the proposed framework InterFeedback for assessing an LMM's ability to improve itself through feedback. The model interacts with humans to progressively solve a problem, and after each conversation round, we verify the correctness of the answer. If the answer is incorrect, an LMM-stimulated human will provide constructive feedback. We implement two types of feedback to investigate the behavior of LMMs.
  • Figure 4: Distribution of samples being corrected in each round. We can observe that Claude-3.5-Sonnet archives the best performance in round 0.
  • Figure 5: Distribution of corrected samples across various task categories. Visual logic tasks are mostly resolved within the first two rounds, whereas Math (Text-only) and MMMU-Pro tasks show few corrections in rounds 1 and 2. In contrast, Coding (Text-only) and MathVerse tasks exhibit corrections during rounds 1 and 2.
  • ...and 5 more figures