InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback
Henry Hengyuan Zhao, Wenqi Pei, Yifei Tao, Haiyang Mei, Mike Zheng Shou
TL;DR
InterFeedback tackles the problem of evaluating and enabling interactive intelligence in large multimodal models by introducing a formal framework and benchmark suite. The InterFeedback-Bench uses a POMDP-based formulation and two challenging datasets, MMMU-Pro and MathVerse, to automate interactive problem-solving and self-improvement from feedback, complemented by InterFeedback-Human for manual assessments. Across 12 open-source LMMs and several proprietary providers, findings show that interactive feedback can improve performance on some tasks but most models struggle to meaningfully leverage feedback, and that accuracy alone does not capture a model's capacity to benefit from feedback. The work also emphasizes that feedback quality matters and that simple, binary feedback can sometimes outperform more detailed but noisier explanations, highlighting the need for better mechanisms to interpret and incorporate feedback in LMMs with practical implications for human-AI collaboration.
Abstract
Existing benchmarks do not test Large Multimodal Models (LMMs) on their interactive intelligence with human users, which is vital for developing general-purpose AI assistants. We design InterFeedback, an interactive framework, which can be applied to any LMM and dataset to assess this ability autonomously. On top of this, we introduce InterFeedback-Bench which evaluates interactive intelligence using two representative datasets, MMMU-Pro and MathVerse, to test 10 different open-source LMMs. Additionally, we present InterFeedback-Human, a newly collected dataset of 120 cases designed for manually testing interactive performance in leading models such as OpenAI-o1 and Claude-Sonnet-4. Our evaluation results indicate that even the state-of-the-art LMM, OpenAI-o1, struggles to refine its responses based on human feedback, achieving an average score of less than 50%. Our findings point to the need for methods that can enhance LMMs' capabilities to interpret and benefit from feedback.
