Table of Contents
Fetching ...

Hindsight Distillation Reasoning with Knowledge Encouragement Preference for Knowledge-based Visual Question Answering

Yu Zhao, Ying Zhang, Xuhui Sui, Baohang Zhou, Li Shen, Dacheng Tao

TL;DR

This work introduces HinD, a framework for Knowledge-based Visual Question Answering that elicits explicit internal reasoning from multimodal LLMs by generating Hindsight-Zero data and distilling it into separate CoT and Knowledge generators. It couples this with Knowledge Encouragement Preference Optimization (KEPO) to align confidence with correctness, enabling better use of knowledge beyond the image. Through Hindsight Distillation Fine-Tuning and self-consistency, HinD demonstrates strong performance on OK-VQA and A-OKVQA with a 7B backbone and no external APIs, surpassing several larger models and retrieval-based baselines. The approach offers a scalable pathway to transparent reasoning in KBVQA, with potential for extension to other knowledge-intensive, multimodal tasks.

Abstract

Knowledge-based Visual Question Answering (KBVQA) necessitates external knowledge incorporation beyond cross-modal understanding. Existing KBVQA methods either utilize implicit knowledge in multimodal large language models (MLLMs) via in-context learning or explicit knowledge via retrieval augmented generation. However, their reasoning processes remain implicit, without explicit multi-step trajectories from MLLMs. To address this gap, we provide a Hindsight Distilled Reasoning (HinD) framework with Knowledge Encouragement Preference Optimization (KEPO), designed to elicit and harness internal knowledge reasoning ability in MLLMs. First, to tackle the reasoning supervision problem, we propose to emphasize the hindsight wisdom of MLLM by prompting a frozen 7B-size MLLM to complete the reasoning process between the question and its ground truth answer, constructing Hindsight-Zero training data. Then we self-distill Hindsight-Zero into Chain-of-Thought (CoT) Generator and Knowledge Generator, enabling the generation of sequential steps and discrete facts. Secondly, to tackle the misalignment between knowledge correctness and confidence, we optimize the Knowledge Generator with KEPO, preferring under-confident but helpful knowledge over the over-confident but unhelpful one. The generated CoT and sampled knowledge are then exploited for answer prediction. Experiments on OK-VQA and A-OKVQA validate the effectiveness of HinD, showing that HinD with elicited reasoning from 7B-size MLLM achieves superior performance without commercial model APIs or outside knowledge.

Hindsight Distillation Reasoning with Knowledge Encouragement Preference for Knowledge-based Visual Question Answering

TL;DR

This work introduces HinD, a framework for Knowledge-based Visual Question Answering that elicits explicit internal reasoning from multimodal LLMs by generating Hindsight-Zero data and distilling it into separate CoT and Knowledge generators. It couples this with Knowledge Encouragement Preference Optimization (KEPO) to align confidence with correctness, enabling better use of knowledge beyond the image. Through Hindsight Distillation Fine-Tuning and self-consistency, HinD demonstrates strong performance on OK-VQA and A-OKVQA with a 7B backbone and no external APIs, surpassing several larger models and retrieval-based baselines. The approach offers a scalable pathway to transparent reasoning in KBVQA, with potential for extension to other knowledge-intensive, multimodal tasks.

Abstract

Knowledge-based Visual Question Answering (KBVQA) necessitates external knowledge incorporation beyond cross-modal understanding. Existing KBVQA methods either utilize implicit knowledge in multimodal large language models (MLLMs) via in-context learning or explicit knowledge via retrieval augmented generation. However, their reasoning processes remain implicit, without explicit multi-step trajectories from MLLMs. To address this gap, we provide a Hindsight Distilled Reasoning (HinD) framework with Knowledge Encouragement Preference Optimization (KEPO), designed to elicit and harness internal knowledge reasoning ability in MLLMs. First, to tackle the reasoning supervision problem, we propose to emphasize the hindsight wisdom of MLLM by prompting a frozen 7B-size MLLM to complete the reasoning process between the question and its ground truth answer, constructing Hindsight-Zero training data. Then we self-distill Hindsight-Zero into Chain-of-Thought (CoT) Generator and Knowledge Generator, enabling the generation of sequential steps and discrete facts. Secondly, to tackle the misalignment between knowledge correctness and confidence, we optimize the Knowledge Generator with KEPO, preferring under-confident but helpful knowledge over the over-confident but unhelpful one. The generated CoT and sampled knowledge are then exploited for answer prediction. Experiments on OK-VQA and A-OKVQA validate the effectiveness of HinD, showing that HinD with elicited reasoning from 7B-size MLLM achieves superior performance without commercial model APIs or outside knowledge.

Paper Structure

This paper contains 25 sections, 6 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: KBVQA requires necessary knowledge incorporation for reasoning. Unlike existing in-context learning and retrieval-augmented methods, we elicit the reasoning paths inside MLLMs through Hindsight self-distillation.
  • Figure 2: Our Hindsight Distilled Reasoning (HinD) framework for KBVQA: (1) Constructing training data Hindsight-Zero by completing reasoning processes from question to ground truth answer. (2) Hindsight Distillation Fine-Tuning for CoT Generator and Knowledge Generator. (3) Knowledge Encouragement Preference Optimization for calibrating knowledge generation confidence and correctness. (4) Answer Generation to infer the final answer based on the generated CoT and knowledge.
  • Figure 3: The joint distribution of Knowledge Generator's confidence $C$ and Hits (mentioned answer count out of 10 annotated ground truth answers) on OK-VQA test set: (1) Zero-shot model (Qwen2.5-VL-7B), (2) HinD-Know with HDFT, and (3) HinD-Know with HDFT and KEPO.
  • Figure 4: Case study on OK-VQA test set. The answer heuristics candidates are from the smaller VQA model MCAN in Prophet yu2025prophet. The oracle documents are those in Google Search that include the ground-truth answers. Zero-shot generated CoT and Knowledge is from untrained Qwen2.5-VL-7B without HDFT or KEPO.
  • Figure 5: Confusion matrix on OK-VQA test set. The vertical ✓ & ✗ denote the prediction accuracy of Answer Generator, while the horizontal ✓ & ✗ denote whether HinD generated Knowledge and CoT hit the ground-truth answer.