Table of Contents
Fetching ...

Diffusion Instruction Tuning

Chen Jin, Ryutaro Tanno, Amrutha Saseendran, Tom Diethe, Philip Teare

TL;DR

Lavender introduces diffusion instruction tuning, a data-efficient method that aligns vision-language model attention with diffusion-model attention to improve multimodal reasoning. The approach formalizes a Bayesian objective where L_total(θ) = L_VLM(θ) + λ L_att(θ), and deploys a lightweight Aligner to map diffusion attention into VLM attention, enabling efficient fine-tuning with minimal data. It supports cross-attention and self-attention VLMs, uses learned attention aggregations and LoRA-style parameter-efficient training, and demonstrates significant gains across 20 benchmarks (up to 30% on Llama-3.2-11B-vision-instruct and a 68% boost on WorldMedQA V) with only 0.13M fine-tuning examples. The results indicate diffusion-model attention maps provide high-quality, data-efficient priors for grounding visual-text interactions, offering a scalable, architecture-agnostic path to stronger vision-language systems with practical compute requirements.

Abstract

We introduce Lavender, a simple supervised fine-tuning (SFT) method that boosts the performance of advanced vision-language models (VLMs) by leveraging state-of-the-art image generation models such as Stable Diffusion. Specifically, Lavender aligns the text-vision attention in the VLM transformer with the equivalent used by Stable Diffusion during SFT, instead of adapting separate encoders. This alignment enriches the model's visual understanding and significantly boosts performance across in- and out-of-distribution tasks. Lavender requires just 0.13 million training examples, 2.5% of typical large-scale SFT datasets, and fine-tunes on standard hardware (8 GPUs) in a single day. It consistently improves state-of-the-art open-source multimodal LLMs (e.g., Llama-3.2-11B, MiniCPM-Llama3-v2.5), achieving up to 30% gains and a 68% boost on challenging out-of-distribution medical QA tasks. By efficiently transferring the visual expertise of image generators with minimal supervision, Lavender offers a scalable solution for more accurate vision-language systems. All code, training data, and models will be shared at https://astrazeneca.github.io/vlm/.

Diffusion Instruction Tuning

TL;DR

Lavender introduces diffusion instruction tuning, a data-efficient method that aligns vision-language model attention with diffusion-model attention to improve multimodal reasoning. The approach formalizes a Bayesian objective where L_total(θ) = L_VLM(θ) + λ L_att(θ), and deploys a lightweight Aligner to map diffusion attention into VLM attention, enabling efficient fine-tuning with minimal data. It supports cross-attention and self-attention VLMs, uses learned attention aggregations and LoRA-style parameter-efficient training, and demonstrates significant gains across 20 benchmarks (up to 30% on Llama-3.2-11B-vision-instruct and a 68% boost on WorldMedQA V) with only 0.13M fine-tuning examples. The results indicate diffusion-model attention maps provide high-quality, data-efficient priors for grounding visual-text interactions, offering a scalable, architecture-agnostic path to stronger vision-language systems with practical compute requirements.

Abstract

We introduce Lavender, a simple supervised fine-tuning (SFT) method that boosts the performance of advanced vision-language models (VLMs) by leveraging state-of-the-art image generation models such as Stable Diffusion. Specifically, Lavender aligns the text-vision attention in the VLM transformer with the equivalent used by Stable Diffusion during SFT, instead of adapting separate encoders. This alignment enriches the model's visual understanding and significantly boosts performance across in- and out-of-distribution tasks. Lavender requires just 0.13 million training examples, 2.5% of typical large-scale SFT datasets, and fine-tunes on standard hardware (8 GPUs) in a single day. It consistently improves state-of-the-art open-source multimodal LLMs (e.g., Llama-3.2-11B, MiniCPM-Llama3-v2.5), achieving up to 30% gains and a 68% boost on challenging out-of-distribution medical QA tasks. By efficiently transferring the visual expertise of image generators with minimal supervision, Lavender offers a scalable solution for more accurate vision-language systems. All code, training data, and models will be shared at https://astrazeneca.github.io/vlm/.

Paper Structure

This paper contains 81 sections, 44 equations, 30 figures, 3 tables, 2 algorithms.

Figures (30)

  • Figure 1: Average Performance on 20 Vision-Language Reasoning Benchmarks (Grouped into 4 Categories).
  • Figure 2: Lavender: Diffusion Instruction Tuning. Lavender uses the text-vision attention maps of a Stable Diffusion Model, $Attention_{SDM}$, as a guiding objective for the attention of the target vision-language model (VLM), $Attention_{VLM}$. The Attention Alignment module employs a 3-Layer ConvNet to transform $Attention_{VLM}$ to match $Attention_{SDM}$ via an MSE loss, acting as a regularisation term during supervised fine-tuning.
  • Figure 3: Image generation models (Stable Diffusion on the left) exhibit stronger word-to-region attention alignment than VLMs (Open-Flamingo on the right). Per-word average attention maps suggest that diffusion models may be closer to an ideal distribution correlating image regions with textual tokens.
  • Figure 4: Sketch of Diffusion Instruction Tuning (left) and a short pseudo code (right), whose full version is available in Appendix \ref{['sec:full_pseudo_code']}.
  • Figure 5: Illustration of attention aggregation in VLMs. Attention weights between text tokens and image patches are aggregated to form per-word saliency maps.
  • ...and 25 more figures