SDRT: Enhance Vision-Language Models by Self-Distillation with Diverse Reasoning Traces
Guande Wu, Huan Song, Yawei Wang, Qiaojing Yan, Yijun Tian, Lin Lee Cheong, Panpan Xu
TL;DR
This work tackles the challenge of enabling robust reasoning in vision-language models by introducing a self-distillation framework that extracts diverse, two-step reasoning traces via a prompt library and uses them to fine-tune a VLM. Key architectural innovations—an intervention adapter, a cross-modal skip connection, and ensemble weighting—facilitate efficient learning from latent reasoning while preserving parameter efficiency. The approach yields consistent improvements across five VQA benchmarks, with notable ANLS and accuracy gains compared to strong baselines, demonstrating the practicality of self-distillation for multi-modal reasoning. By internalizing diverse reasoning traces without full-model retraining, the method offers a scalable pathway to enhance VLM reasoning in real-world tasks requiring complex cross-modal inference.
Abstract
Reasoning is increasingly crucial for various tasks. While chain-of-thought prompting enables large language models to leverage reasoning effectively, harnessing the reasoning capabilities of Vision-Language Models (VLMs) remains challenging. To solve this problem, we propose a novel self-distillation framework that enhances the reasoning capabilities of the model. The proposed framework introduces several key innovations. We start by employing a prompt library tailored to visual reasoning tasks to generate diverse in-context questions and utilize a two-step reasoning procedure to derive reasoning-guided responses. These responses are then used for self-distillation, enabling the model to internalize the reasoning process. Additionally, we improve the model architecture with several innovative components, including an intervention adapter for efficient parameter updates, a cross-modal skip connection to facilitate information exchange between modalities, and an ensemble learning algorithm to integrate diverse reasoning from multiple in-context questions. Extensive experiments show that our method significantly improves the baseline performance across five VQA datasets.
