Diffusion Instruction Tuning
Chen Jin, Ryutaro Tanno, Amrutha Saseendran, Tom Diethe, Philip Teare
TL;DR
Lavender introduces diffusion instruction tuning, a data-efficient method that aligns vision-language model attention with diffusion-model attention to improve multimodal reasoning. The approach formalizes a Bayesian objective where L_total(θ) = L_VLM(θ) + λ L_att(θ), and deploys a lightweight Aligner to map diffusion attention into VLM attention, enabling efficient fine-tuning with minimal data. It supports cross-attention and self-attention VLMs, uses learned attention aggregations and LoRA-style parameter-efficient training, and demonstrates significant gains across 20 benchmarks (up to 30% on Llama-3.2-11B-vision-instruct and a 68% boost on WorldMedQA V) with only 0.13M fine-tuning examples. The results indicate diffusion-model attention maps provide high-quality, data-efficient priors for grounding visual-text interactions, offering a scalable, architecture-agnostic path to stronger vision-language systems with practical compute requirements.
Abstract
We introduce Lavender, a simple supervised fine-tuning (SFT) method that boosts the performance of advanced vision-language models (VLMs) by leveraging state-of-the-art image generation models such as Stable Diffusion. Specifically, Lavender aligns the text-vision attention in the VLM transformer with the equivalent used by Stable Diffusion during SFT, instead of adapting separate encoders. This alignment enriches the model's visual understanding and significantly boosts performance across in- and out-of-distribution tasks. Lavender requires just 0.13 million training examples, 2.5% of typical large-scale SFT datasets, and fine-tunes on standard hardware (8 GPUs) in a single day. It consistently improves state-of-the-art open-source multimodal LLMs (e.g., Llama-3.2-11B, MiniCPM-Llama3-v2.5), achieving up to 30% gains and a 68% boost on challenging out-of-distribution medical QA tasks. By efficiently transferring the visual expertise of image generators with minimal supervision, Lavender offers a scalable solution for more accurate vision-language systems. All code, training data, and models will be shared at https://astrazeneca.github.io/vlm/.
