Table of Contents
Fetching ...

Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the Wild

Wanpeng Hu, Haodi Liu, Lin Chen, Feng Zhou, Changming Xiao, Qi Yang, Changshui Zhang

TL;DR

This work introduces Socratic Questioning (SQ), a multi-round framework that blends Chain-of-Thought reasoning with visual instruction tuning to enable fine-grained, hallucination-resistant visual reasoning on lightweight multimodal models. A new CapQA dataset is built to train and evaluate SQ through a structured, multi-turn conversation of questions, answers, detailed descriptions, and concise captions, achieving notable reductions in hallucinations and improvements in questioning quality. The method demonstrates strong zero-shot performance across diverse benchmarks and reduces training costs by using a shared LLM and adapter-based visual grounding, with GPT-4v-based automatic annotations guiding data creation. The approach offers a practical, scalable path for robust multimodal reasoning in real-world settings, and code/data will be released to foster further research.

Abstract

Complex visual reasoning remains a key challenge today. Typically, the challenge is tackled using methodologies such as Chain of Thought (COT) and visual instruction tuning. However, how to organically combine these two methodologies for greater success remains unexplored. Also, issues like hallucinations and high training cost still need to be addressed. In this work, we devise an innovative multi-round training and reasoning framework suitable for lightweight Multimodal Large Language Models (MLLMs). Our self-questioning approach heuristically guides MLLMs to focus on visual clues relevant to the target problem, reducing hallucinations and enhancing the model's ability to describe fine-grained image details. This ultimately enables the model to perform well in complex visual reasoning and question-answering tasks. We have named this framework Socratic Questioning(SQ). To facilitate future research, we create a multimodal mini-dataset named CapQA, which includes 1k images of fine-grained activities, for visual instruction tuning and evaluation, our proposed SQ method leads to a 31.2% improvement in the hallucination score. Our extensive experiments on various benchmarks demonstrate SQ's remarkable capabilities in heuristic self-questioning, zero-shot visual reasoning and hallucination mitigation. Our model and code will be publicly available.

Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the Wild

TL;DR

This work introduces Socratic Questioning (SQ), a multi-round framework that blends Chain-of-Thought reasoning with visual instruction tuning to enable fine-grained, hallucination-resistant visual reasoning on lightweight multimodal models. A new CapQA dataset is built to train and evaluate SQ through a structured, multi-turn conversation of questions, answers, detailed descriptions, and concise captions, achieving notable reductions in hallucinations and improvements in questioning quality. The method demonstrates strong zero-shot performance across diverse benchmarks and reduces training costs by using a shared LLM and adapter-based visual grounding, with GPT-4v-based automatic annotations guiding data creation. The approach offers a practical, scalable path for robust multimodal reasoning in real-world settings, and code/data will be released to foster further research.

Abstract

Complex visual reasoning remains a key challenge today. Typically, the challenge is tackled using methodologies such as Chain of Thought (COT) and visual instruction tuning. However, how to organically combine these two methodologies for greater success remains unexplored. Also, issues like hallucinations and high training cost still need to be addressed. In this work, we devise an innovative multi-round training and reasoning framework suitable for lightweight Multimodal Large Language Models (MLLMs). Our self-questioning approach heuristically guides MLLMs to focus on visual clues relevant to the target problem, reducing hallucinations and enhancing the model's ability to describe fine-grained image details. This ultimately enables the model to perform well in complex visual reasoning and question-answering tasks. We have named this framework Socratic Questioning(SQ). To facilitate future research, we create a multimodal mini-dataset named CapQA, which includes 1k images of fine-grained activities, for visual instruction tuning and evaluation, our proposed SQ method leads to a 31.2% improvement in the hallucination score. Our extensive experiments on various benchmarks demonstrate SQ's remarkable capabilities in heuristic self-questioning, zero-shot visual reasoning and hallucination mitigation. Our model and code will be publicly available.
Paper Structure (43 sections, 2 equations, 6 figures, 12 tables)

This paper contains 43 sections, 2 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Comparison of Questions Generation on LLaVA-with-SQ and LLaVA.
  • Figure 2: SQ network architecture. Note that the two LLM modules correspond to a single LLM. The visual encoder outputs visual features that will be mapped by the adapter to visual tokens. The visual tokens, along with the self-ask abd self-answer prompt token, making the LLM generate a rationale comprised of Q&A pairs. Then the same LLM takes the rationale tokens and the description and summarization prompt tokens to produce the final caption.
  • Figure 2: Ablation w/o multi-turn train/inference on CapQA. We adapt vicuna7b as LLM. HalS: Hallucination Score; QQS: Questions Quality Score. Evaluation are GPT4 GPT4-aid.
  • Figure 3: Illustrations of the training (left) and 3-turn inference (right) processes of SQ.
  • Figure 4: Comparison of Questions Generation on LLaVA-with-SQ and LLaVA.
  • ...and 1 more figures