Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation

Kohei Uehara; Nabarun Goswami; Hanqin Wang; Toshiaki Baba; Kohtaro Tanaka; Tomohiro Hashimoto; Kai Wang; Rei Ito; Takagi Naoya; Ryo Umagami; Yingyi Wen; Tanachai Anakewat; Tatsuya Harada

Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation

Kohei Uehara, Nabarun Goswami, Hanqin Wang, Toshiaki Baba, Kohtaro Tanaka, Tomohiro Hashimoto, Kai Wang, Rei Ito, Takagi Naoya, Ryo Umagami, Yingyi Wen, Tanachai Anakewat, Tatsuya Harada

TL;DR

This work tackles the challenge of making Vision-and-Language Models more reliable and interpretable by endowing them with explicit Chain-of-Reasoning (CoR) and a proactive question-asking mechanism. It introduces a novel CoR dataset generated by an LLM, which encodes reasoning steps, imagined knowledge needs, questions, and answers, and demonstrates how fine-tuning a VLM on this data enables explicit reasoning and information seeking during inference. The model architecture builds on a CLIP-ViT–based image encoder and an adapter-MLP bridge to a text decoder, enabling stepwise reasoning and question generation, with external answers supplied during inference. Experimental results show improved performance across diverse V&L tasks, especially those requiring specialized knowledge, highlighting the value of explicit reasoning and interactive knowledge acquisition for robust, interpretable VLMs.

Abstract

The increasing demand for intelligent systems capable of interpreting and reasoning about visual content requires the development of large Vision-and-Language Models (VLMs) that are not only accurate but also have explicit reasoning capabilities. This paper presents a novel approach to develop a VLM with the ability to conduct explicit reasoning based on visual content and textual instructions. We introduce a system that can ask a question to acquire necessary knowledge, thereby enhancing the robustness and explicability of the reasoning process. To this end, we developed a novel dataset generated by a Large Language Model (LLM), designed to promote chain-of-thought reasoning combined with a question-asking mechanism. The dataset covers a range of tasks, from common ones like caption generation to specialized VQA tasks that require expert knowledge. Furthermore, using the dataset we created, we fine-tuned an existing VLM. This training enabled the models to generate questions and perform iterative reasoning during inference. The results demonstrated a stride toward a more robust, accurate, and interpretable VLM, capable of reasoning explicitly and seeking information proactively when confronted with ambiguous visual input.

Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation

TL;DR

Abstract

Paper Structure (15 sections, 13 figures, 2 tables)

This paper contains 15 sections, 13 figures, 2 tables.

Introduction
Related Work
Large-scale Vision-and-Language Models
Explicit Reasoning in V&L Tasks
Visual Question Generation
Method
Dataset Construction
Dataset Statistics
Model Architecture
Training
Experiments
Implementation Details
Evaluation Settings
Results and Discussions
Conclusion

Figures (13)

Figure 1: An example of the explicit reasoning steps we aim to achieve, also representing a sample from our constructed dataset. It demonstrates the thought process in response to a given question. Notably, we incorporate a question generation step into the reasoning process (as seen on the right side of the figure), allowing the model to interactively acquire knowledge and refine its reasoning steps.
Figure 2: An overview of the dataset construction process. The LLM takes bounding boxes information, image captions, and instructions as input and generates reasoning steps and questions as output.
Figure 3: An example from our dataset, created from the OK-VQA dataset. To the given question, the dataset contains a series of reasoning steps in three settings: without QA, with QA, and with GT.
Figure 4: Examples of dataset in each category: visual understanding tasks, vision + common-sense understanding tasks, and vision + encyclopedic knowledge tasks.
Figure 5: An overview of the model. Image encoders extract embeddings from input images, which are fed into the Adapter MLPs. The extracted image feature and instruction texts are fed into the LLM, culminating in the generation of a text response. This architecture enables the model to consider both visual information and textual instructions in its reasoning process.
...and 8 more figures

Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation

TL;DR

Abstract

Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (13)