Learning to Correction: Explainable Feedback Generation for Visual Commonsense Reasoning Distractor
Jiali Chen, Xusen Hei, Yuqi Xue, Yuancheng Wei, Jiayuan Xie, Yi Cai, Qing Li
TL;DR
Addressing the gap in evaluating error-correction capabilities of large multimodal models on visual commonsense reasoning, the paper introduces the VCR-DF dataset and a PEIFG model for explainable feedback generation. PEIFG combines a Visual Feature Extractor, an Expert Prompt Selector, and a Text Generator, with a refinement loop to ensure explainability and fidelity to inputs. Experimental results show PEIFG outperforms strong baselines, including GPT-4V, across both distractor generation and feedback quality, with ablations clarifying the contributions of visual markers, expert prompts, and refinement. The work provides a benchmark and methodology for assessing LMM error-correction and educationally grounded feedback in multimodal reasoning, highlighting the potential of learned expert prompts and multimodal instructions for interpretable, corrective behavior.
Abstract
Large multimodal models (LMMs) have shown remarkable performance in the visual commonsense reasoning (VCR) task, which aims to answer a multiple-choice question based on visual commonsense within an image. However, the ability of LMMs to correct potential visual commonsense errors in the distractor upon their occurrence is yet under-explored. Drawing inspiration from how a human teacher crafts challenging distractors to test students' comprehension of the concepts or skills and assists them in identifying and correcting errors toward the answer, we are the pioneering research for LMMs to simulate this error correction process. To this end, we employ GPT-4 as a ``teacher'' to collect the explainable feedback dataset VCR-DF for error correction, which serves as a benchmark to evaluate the ability of LMMs to identify misconceptions and clarify reasons behind the error in VCR distractors toward final answers. In addition, we propose an LMM-based Pedagogical Expert Instructed Feedback Generation (PEIFG) model to incorporate the learnable expert prompts and multimodal instruction as guidance for feedback generation. Experimental results show that our PEIFG significantly outperforms existing LMMs. We believe that our benchmark provides a new direction for evaluating the capabilities of LMMs.
