Table of Contents
Fetching ...

Elevating Visual Question Answering through Implicitly Learned Reasoning Pathways in LVLMs

Liu Jing, Amirul Rahman

TL;DR

This work addresses the difficulty of multi-step visual reasoning in large vision-language models by introducing MF-SQ-LLaVA, which enables implicit self-questioning through end-to-end training. The model learns to generate and answer internal sub-questions via reasoning chains augmented in the data and guided by a multi-task loss over sub-question generation, sub-question answering, and final answer prediction. Empirical results on ScienceQA and VQAv2 show state-of-the-art performance gains over strong baselines, with ablations confirming the contribution of each component and human evaluation validating improved reasoning coherence. The approach reduces reliance on external language models at inference and demonstrates robust improvements across reasoning depth, subject areas, and answer types, indicating strong practical impact for complex visual reasoning tasks.

Abstract

Large Vision-Language Models (LVLMs) have shown remarkable progress in various multimodal tasks, yet they often struggle with complex visual reasoning that requires multi-step inference. To address this limitation, we propose MF-SQ-LLaVA, a novel approach that enhances LVLMs by enabling implicit self-questioning through end-to-end training. Our method involves augmenting visual question answering datasets with reasoning chains consisting of sub-question and answer pairs, and training the LVLM with a multi-task loss that encourages the generation and answering of these intermediate steps, as well as the prediction of the final answer. We conduct extensive experiments on the ScienceQA and VQAv2 datasets, demonstrating that MF-SQ-LLaVA significantly outperforms existing state-of-the-art models, including the base LLaVA and the original SQ-LLaVA. Ablation studies further validate the contribution of each component of our approach, and human evaluation confirms the improved accuracy and coherence of the reasoning process enabled by our method.

Elevating Visual Question Answering through Implicitly Learned Reasoning Pathways in LVLMs

TL;DR

This work addresses the difficulty of multi-step visual reasoning in large vision-language models by introducing MF-SQ-LLaVA, which enables implicit self-questioning through end-to-end training. The model learns to generate and answer internal sub-questions via reasoning chains augmented in the data and guided by a multi-task loss over sub-question generation, sub-question answering, and final answer prediction. Empirical results on ScienceQA and VQAv2 show state-of-the-art performance gains over strong baselines, with ablations confirming the contribution of each component and human evaluation validating improved reasoning coherence. The approach reduces reliance on external language models at inference and demonstrates robust improvements across reasoning depth, subject areas, and answer types, indicating strong practical impact for complex visual reasoning tasks.

Abstract

Large Vision-Language Models (LVLMs) have shown remarkable progress in various multimodal tasks, yet they often struggle with complex visual reasoning that requires multi-step inference. To address this limitation, we propose MF-SQ-LLaVA, a novel approach that enhances LVLMs by enabling implicit self-questioning through end-to-end training. Our method involves augmenting visual question answering datasets with reasoning chains consisting of sub-question and answer pairs, and training the LVLM with a multi-task loss that encourages the generation and answering of these intermediate steps, as well as the prediction of the final answer. We conduct extensive experiments on the ScienceQA and VQAv2 datasets, demonstrating that MF-SQ-LLaVA significantly outperforms existing state-of-the-art models, including the base LLaVA and the original SQ-LLaVA. Ablation studies further validate the contribution of each component of our approach, and human evaluation confirms the improved accuracy and coherence of the reasoning process enabled by our method.

Paper Structure

This paper contains 23 sections, 7 equations, 6 tables.