II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering

Jihyung Kil; Farideh Tavazoee; Dongyeop Kang; Joo-Kyung Kim

II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering

Jihyung Kil, Farideh Tavazoee, Dongyeop Kang, Joo-Kyung Kim

TL;DR

This work tackles the challenge of evaluating and enhancing multi-hop reasoning in visual question answering (VQA). It introduces II-MMR, which uses two novel language prompts—Answer Prediction-Guided CoT (ApCoT) and Knowledge Triplet-Guided Prompt (KtPrompt)—to derive an explicit reasoning path for each question-image pair. By analyzing these paths, II-MMR identifies the distribution of reasoning hops and distinguishes visual from beyond-visual reasoning, revealing biases in benchmarks like GQA and A-OKVQA. The approach improves VQA performance across reasoning cases in both zero-shot and fine-tuning settings and demonstrates applicability to different vision-language models. These insights point to more informative benchmarks and improved reasoning capabilities for future V&L systems.

Abstract

Visual Question Answering (VQA) often involves diverse reasoning scenarios across Vision and Language (V&L). Most prior VQA studies, however, have merely focused on assessing the model's overall accuracy without evaluating it on different reasoning cases. Furthermore, some recent works observe that conventional Chain-of-Thought (CoT) prompting fails to generate effective reasoning for VQA, especially for complex scenarios requiring multi-hop reasoning. In this paper, we propose II-MMR, a novel idea to identify and improve multi-modal multi-hop reasoning in VQA. In specific, II-MMR takes a VQA question with an image and finds a reasoning path to reach its answer using two novel language promptings: (i) answer prediction-guided CoT prompt, or (ii) knowledge triplet-guided prompt. II-MMR then analyzes this path to identify different reasoning cases in current VQA benchmarks by estimating how many hops and what types (i.e., visual or beyond-visual) of reasoning are required to answer the question. On popular benchmarks including GQA and A-OKVQA, II-MMR observes that most of their VQA questions are easy to answer, simply demanding "single-hop" reasoning, whereas only a few questions require "multi-hop" reasoning. Moreover, while the recent V&L model struggles with such complex multi-hop reasoning questions even using the traditional CoT method, II-MMR shows its effectiveness across all reasoning cases in both zero-shot and fine-tuning settings.

II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering

TL;DR

Abstract

Paper Structure (22 sections, 8 figures, 9 tables)

This paper contains 22 sections, 8 figures, 9 tables.

Introduction
Proposed Approach: II-MMR
Finding a reasoning path to the answer
Preliminary Analysis
Answer prediction-guided CoT (ApCoT)
Knowledge Triplet-guided Prompt (KtPrompt)
Analyzing a reasoning path
Model performance on the reasoning cases
Experimental Setup
Experimenetal Results
Analysis of reasoning in VQA benchmarks
Accuracy of predicting hops and reasoning path
Benefit of II-MMR in zero-shot stage
Benefit of II-MMR in fine-tuning stage
Expanding questions with more reasoning
...and 7 more sections

Figures (8)

Figure 1: Overview of II-MMR. Our II-MMR automatically identifies different reasoning cases in VQA benchmarks by measuring how many and what types ( visual or beyond-visual) of reasoning are required to solve a VQA question. The identified reasoning process in II-MMR also helps make a correct prediction (Cold), while the simple Chain-of-Thought (CoT) method kojima2022large fails to answer.
Figure 2: Pipeline of II-MMR. Given a VQA question with its image, II-MMR first generates a reasoning path to the answer either using the V&L model (VLM) or the LLM. We then utilize this path to identify different reasoning cases in VQA benchmarks by estimating the number and types ( visual or beyond-visual) of reasoning required for the question. Finally, II-MMR feeds the reasoning path, along with the question and the image, into VLM to predict the answer.
Figure 3: The language promptings of our II-MMR.Top: II-MMRApCoT first asks the VLM to predict an answer for a VQA question. It then integrates its prediction into the CoT prompt to generate an answer-related rationale, a sequence of reasoning sentences. Bottom: II-MMRKtPrompt initially instructs the LLM to convert the question and answer (QA) to the caption. Then, II-MMRKtPrompt inputs a prompt (with task, in-context example, and target caption) to the LLM to extract knowledge triplets from QA. We treat the sequence of sentences (or knowledge triplets) as the reasoning path to reach the answer.
Figure 4: Analyzing the reasoning types. The LLM extracts keywords (e.g., "bottle", "temperature") from each reasoning sentence. Meanwhile, the object detector identifies objects (e.g., "fridge", "bottle") in the image. We then check if all keywords match visual objects and decide the reasoning type ( visual or beyond-visual) of each sentence in the rationale.
Figure 5: In-context language prompting to make the original question more complex. "Bridge Entity" is a keyword extracted from the original question. "Captions" is the text snippet containing information about the bridge entity retrieved from Wikipedia. Using Wikipedia captions, we ask the LLM to increase the reasoning complexity in the original question while maintaining its original answer. We provide five in-context examples to the LLM.
...and 3 more figures

II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering

TL;DR

Abstract

II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering

Authors

TL;DR

Abstract

Table of Contents

Figures (8)