Visual Question Decomposition on Multimodal Large Language Models

Haowei Zhang; Jianzhe Liu; Zhen Han; Shuo Chen; Bailan He; Volker Tresp; Zhiqiang Xu; Jindong Gu

Visual Question Decomposition on Multimodal Large Language Models

Haowei Zhang, Jianzhe Liu, Zhen Han, Shuo Chen, Bailan He, Volker Tresp, Zhiqiang Xu, Jindong Gu

TL;DR

This work investigates Visual Question Decomposition (VQD) for Multimodal LLMs, revealing that current MLLMs generate sub-questions of low quality. It introduces SubQuestRater to quantify sub-question quality along Non-Repetition, Relevance, and Groundedness, and builds DecoVQA to train and evaluate VQD ability. The authors then extend to DecoVQA+ with an extra selective-decomposition step and a novel loss, SelectiveVQD Loss, to teach models when to decompose. Finetuning across several MLLMs yields substantial improvements in sub-question quality, decomposition policy, and downstream VQA accuracy under selective VQD, demonstrating a practical path to more reliable multimodal reasoning.

Abstract

Question decomposition has emerged as an effective strategy for prompting Large Language Models (LLMs) to answer complex questions. However, while existing methods primarily focus on unimodal language models, the question decomposition capability of Multimodal Large Language Models (MLLMs) has yet to be explored. To this end, this paper explores visual question decomposition on MLLMs. Specifically, we introduce a systematic evaluation framework including a dataset and several evaluation criteria to assess the quality of the decomposed sub-questions, revealing that existing MLLMs struggle to produce high-quality sub-questions. To address this limitation, we propose a specific finetuning dataset, DecoVQA+, for enhancing the model's question decomposition capability. Aiming at enabling models to perform appropriate selective decomposition, we propose an efficient finetuning pipeline. The finetuning pipeline consists of our proposed dataset and a training objective for selective decomposition. Finetuned MLLMs demonstrate significant improvements in the quality of sub-questions and the policy of selective question decomposition. Additionally, the models also achieve higher accuracy with selective decomposition on VQA benchmark datasets.

Visual Question Decomposition on Multimodal Large Language Models

TL;DR

Abstract

Paper Structure (47 sections, 4 equations, 15 figures, 14 tables, 2 algorithms)

This paper contains 47 sections, 4 equations, 15 figures, 14 tables, 2 algorithms.

Introduction
Related Work
Question Decomposition
Multimodal LLMs
How well can MLLMs decompose questions?
Non-Repetition
Relevance
Groundedness
Enhancing MLLM's Visual Question Decomposition Capability
Dataset Construction of DecoVQA
Question Selection & Decomposition Annotation
Dataset Statistics
DecoVQA+
Training Objective
Experiments
...and 32 more sections

Figures (15)

Figure 1: Cases showing that even if the model correctly answers the original question, the generated sub-questions are of low quality: they are irrelevant or repeated from the original question.
Figure 2: Question decomposition examples of high quality and low quality given a certain image and question.
Figure 3: Comparison of VQD ability of different models across three evaluation criteria. Each bar chart represents a specific criterion, comparing the average scores of the original model (in blue) and the corresponding model finetuned with DecoVQA+ (in orange). The vertical axis shows the average scores, while the horizontal axis lists the models. The difference in bar height indicates the performance gain achieved through finetuning.
Figure 4: Cases showing the comparison of question decomposition by different models before and after finetuning. The left image demonstrates MiniGPT-v2's decomposition on A-OKVQA, while the right image shows LLaVA-1.5's decomposition on VQA-Introspect.
Figure 5: Prompt for scoring the quality of sub-questions with GPT-4V.
...and 10 more figures

Visual Question Decomposition on Multimodal Large Language Models

TL;DR

Abstract

Visual Question Decomposition on Multimodal Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (15)