Table of Contents
Fetching ...

SDRT: Enhance Vision-Language Models by Self-Distillation with Diverse Reasoning Traces

Guande Wu, Huan Song, Yawei Wang, Qiaojing Yan, Yijun Tian, Lin Lee Cheong, Panpan Xu

TL;DR

This work tackles the challenge of enabling robust reasoning in vision-language models by introducing a self-distillation framework that extracts diverse, two-step reasoning traces via a prompt library and uses them to fine-tune a VLM. Key architectural innovations—an intervention adapter, a cross-modal skip connection, and ensemble weighting—facilitate efficient learning from latent reasoning while preserving parameter efficiency. The approach yields consistent improvements across five VQA benchmarks, with notable ANLS and accuracy gains compared to strong baselines, demonstrating the practicality of self-distillation for multi-modal reasoning. By internalizing diverse reasoning traces without full-model retraining, the method offers a scalable pathway to enhance VLM reasoning in real-world tasks requiring complex cross-modal inference.

Abstract

Reasoning is increasingly crucial for various tasks. While chain-of-thought prompting enables large language models to leverage reasoning effectively, harnessing the reasoning capabilities of Vision-Language Models (VLMs) remains challenging. To solve this problem, we propose a novel self-distillation framework that enhances the reasoning capabilities of the model. The proposed framework introduces several key innovations. We start by employing a prompt library tailored to visual reasoning tasks to generate diverse in-context questions and utilize a two-step reasoning procedure to derive reasoning-guided responses. These responses are then used for self-distillation, enabling the model to internalize the reasoning process. Additionally, we improve the model architecture with several innovative components, including an intervention adapter for efficient parameter updates, a cross-modal skip connection to facilitate information exchange between modalities, and an ensemble learning algorithm to integrate diverse reasoning from multiple in-context questions. Extensive experiments show that our method significantly improves the baseline performance across five VQA datasets.

SDRT: Enhance Vision-Language Models by Self-Distillation with Diverse Reasoning Traces

TL;DR

This work tackles the challenge of enabling robust reasoning in vision-language models by introducing a self-distillation framework that extracts diverse, two-step reasoning traces via a prompt library and uses them to fine-tune a VLM. Key architectural innovations—an intervention adapter, a cross-modal skip connection, and ensemble weighting—facilitate efficient learning from latent reasoning while preserving parameter efficiency. The approach yields consistent improvements across five VQA benchmarks, with notable ANLS and accuracy gains compared to strong baselines, demonstrating the practicality of self-distillation for multi-modal reasoning. By internalizing diverse reasoning traces without full-model retraining, the method offers a scalable pathway to enhance VLM reasoning in real-world tasks requiring complex cross-modal inference.

Abstract

Reasoning is increasingly crucial for various tasks. While chain-of-thought prompting enables large language models to leverage reasoning effectively, harnessing the reasoning capabilities of Vision-Language Models (VLMs) remains challenging. To solve this problem, we propose a novel self-distillation framework that enhances the reasoning capabilities of the model. The proposed framework introduces several key innovations. We start by employing a prompt library tailored to visual reasoning tasks to generate diverse in-context questions and utilize a two-step reasoning procedure to derive reasoning-guided responses. These responses are then used for self-distillation, enabling the model to internalize the reasoning process. Additionally, we improve the model architecture with several innovative components, including an intervention adapter for efficient parameter updates, a cross-modal skip connection to facilitate information exchange between modalities, and an ensemble learning algorithm to integrate diverse reasoning from multiple in-context questions. Extensive experiments show that our method significantly improves the baseline performance across five VQA datasets.

Paper Structure

This paper contains 17 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The comparison of existing work and our work. Instead of solely relying on reasoning capabilities of pre-trained models, our work proposes a self-distillation pipeline to integrate in-depth reasoning traces into the model.
  • Figure 2: The overall framework. We first introduce a prompt library to generate $N$ in-context question pairs. These questions are then fed into the model in 2 consecutive steps to produce diverse reasoning-guided responses in parallel. Then, the obtained responses are utilized for self-distillation, enabling the model to better internalize the reasoning process. To enhance the model’s ability to capture reasoning effectively, we design and add several innovative components into the model architecture, including an intervention adapter for efficient parameter updates, a cross-modal skip connection to facilitate information exchange across different modalities, and an ensemble learning algorithm to synthesize reasoning derived from multiple in-context questions.
  • Figure 3: An example figure in InfographicVQA dataset test split. The question is "what is the total number of infrastructures in Zhangjiakou that need to be upgraded?"
  • Figure 4: Category-wise analysis on ChartQA dataset. The data comprises of the four different question categories, i.e., data retrieval, compositional, visual and both visual & compositional.
  • Figure 5: Effect of number of ensemble members in the self-distillation framework on DocVQA (top) and InfographicVQA (bottom) results.