FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models

Pengxiang Li; Zhi Gao; Bofei Zhang; Tao Yuan; Yuwei Wu; Mehrtash Harandi; Yunde Jia; Song-Chun Zhu; Qing Li

FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models

Pengxiang Li, Zhi Gao, Bofei Zhang, Tao Yuan, Yuwei Wu, Mehrtash Harandi, Yunde Jia, Song-Chun Zhu, Qing Li

TL;DR

FIRE introduces a large-scale feedback-refinement dataset for vision-language models, enabling spontaneous response refinement based on user feedback. It provides FIRE-100K, generated with GPT-4V, and FIRE-1M, produced by self-play between tuned student and teacher models, together forming 1.1M conversations; FIRE-Bench offers 11K test dialogues across seen and unseen tasks. Training LLaVA-based models on FIRE (FIRE-LLaVA) yields substantial improvements in feedback integration while preserving instruction-following capabilities, demonstrating the value of feedback-driven refinement for efficient, multi-task visual reasoning. The work delivers a scalable data-generation approach, a comprehensive benchmark, and a strong baseline for feedback refinement in real-user multimodal interactions.

Abstract

Vision language models (VLMs) have achieved impressive progress in diverse applications, becoming a prevalent research direction. In this paper, we build FIRE, a feedback-refinement dataset, consisting of 1.1M multi-turn conversations that are derived from 27 source datasets, empowering VLMs to spontaneously refine their responses based on user feedback across diverse tasks. To scale up the data collection, FIRE is collected in two components: FIRE-100K and FIRE-1M, where FIRE-100K is generated by GPT-4V, and FIRE-1M is freely generated via models trained on FIRE-100K. Then, we build FIRE-Bench, a benchmark to comprehensively evaluate the feedback-refining capability of VLMs, which contains 11K feedback-refinement conversations as the test data, two evaluation settings, and a model to provide feedback for VLMs. We develop the FIRE-LLaVA model by fine-tuning LLaVA on FIRE-100K and FIRE-1M, which shows remarkable feedback-refining capability on FIRE-Bench and outperforms untrained VLMs by 50%, making more efficient user-agent interactions and underscoring the significance of the FIRE dataset.

FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models

TL;DR

Abstract

Paper Structure (36 sections, 6 equations, 22 figures, 12 tables)

This paper contains 36 sections, 6 equations, 22 figures, 12 tables.

Introduction
Related Work
Vision Language Models
Vision-Language Data Generation
Feedback Learning in Multimodal Models
Task Definition
fire-100K
fire-1M
fire-Bench
Evaluation Settings
Dataset Analysis
Model
Student Model
Teacher Model
Experiments
...and 21 more sections

Figures (22)

Figure 1: The comparison of the feedback-refining capability among different models. While the original LLaVA hardly improves its responses, our model trained on FIRE can effectively integrate the user feedback and produce much better responses, which are closer to those of GPT-4V.
Figure 2: Data sources in FIRE. Shaded are new data sources in FIRE-Bench.
Figure 3: The pipeline to create FIRE-100K and FIRE-1M data.
Figure 4: We use two settings to evaluate student and teacher models.
Figure 5: Data statistics on fire-100K, fire-1M, fire-Bench.
...and 17 more figures

FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models

TL;DR

Abstract

FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models

Authors

TL;DR

Abstract

Table of Contents

Figures (22)