Table of Contents
Fetching ...

I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models

Zhenxing Mi, Kuan-Chieh Wang, Guocheng Qian, Hanrong Ye, Runtao Liu, Sergey Tulyakov, Kfir Aberman, Dan Xu

TL;DR

ThinkDiff introduces a novel alignment paradigm that transfers multimodal in-context reasoning from vision-language models to diffusion-based image generation by training a lightweight aligner to map VLM features into the shared input space of an LLM decoder, which serves as a proxy for the diffusion decoder during training. It presents two variants, ThinkDiff-LVLM and ThinkDiff-CLIP, leveraging LVLM-generated tokens and CLIP image embeddings respectively, enabling in-context reasoning over interleaved images and text prompts and resulting in markedly improved performance on the CoBSAT benchmark with minimal training resources. The approach avoids the need for large reasoning datasets by using vision-language training as a proxy, yielding robust in-context reasoning, composition of multiple modalities, and even video generation with compatible diffusion backends. This work advances diffusion model capabilities toward multimodal, in-context reasoning, with practical implications for education, design, and creative industries, while stressing the need for safeguards against misuse.

Abstract

This paper presents ThinkDiff, a novel alignment paradigm that empowers text-to-image diffusion models with multimodal in-context understanding and reasoning capabilities by integrating the strengths of vision-language models (VLMs). Existing multimodal diffusion finetuning methods largely focus on pixel-level reconstruction rather than in-context reasoning, and are constrained by the complexity and limited availability of reasoning-based datasets. ThinkDiff addresses these challenges by leveraging vision-language training as a proxy task, aligning VLMs with the decoder of an encoder-decoder large language model (LLM) instead of a diffusion decoder. This proxy task builds on the observation that the $\textbf{LLM decoder}$ shares the same input feature space with $\textbf{diffusion decoders}$ that use the corresponding $\textbf{LLM encoder}$ for prompt embedding. As a result, aligning VLMs with diffusion decoders can be simplified through alignment with the LLM decoder. Without complex training and datasets, ThinkDiff effectively unleashes understanding, reasoning, and composing capabilities in diffusion models. Experiments demonstrate that ThinkDiff significantly improves accuracy from 19.2% to 46.3% on the challenging CoBSAT benchmark for multimodal in-context reasoning generation, with only 5 hours of training on 4 A100 GPUs. Additionally, ThinkDiff demonstrates exceptional performance in composing multiple images and texts into logically coherent images. Project page: https://mizhenxing.github.io/ThinkDiff.

I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models

TL;DR

ThinkDiff introduces a novel alignment paradigm that transfers multimodal in-context reasoning from vision-language models to diffusion-based image generation by training a lightweight aligner to map VLM features into the shared input space of an LLM decoder, which serves as a proxy for the diffusion decoder during training. It presents two variants, ThinkDiff-LVLM and ThinkDiff-CLIP, leveraging LVLM-generated tokens and CLIP image embeddings respectively, enabling in-context reasoning over interleaved images and text prompts and resulting in markedly improved performance on the CoBSAT benchmark with minimal training resources. The approach avoids the need for large reasoning datasets by using vision-language training as a proxy, yielding robust in-context reasoning, composition of multiple modalities, and even video generation with compatible diffusion backends. This work advances diffusion model capabilities toward multimodal, in-context reasoning, with practical implications for education, design, and creative industries, while stressing the need for safeguards against misuse.

Abstract

This paper presents ThinkDiff, a novel alignment paradigm that empowers text-to-image diffusion models with multimodal in-context understanding and reasoning capabilities by integrating the strengths of vision-language models (VLMs). Existing multimodal diffusion finetuning methods largely focus on pixel-level reconstruction rather than in-context reasoning, and are constrained by the complexity and limited availability of reasoning-based datasets. ThinkDiff addresses these challenges by leveraging vision-language training as a proxy task, aligning VLMs with the decoder of an encoder-decoder large language model (LLM) instead of a diffusion decoder. This proxy task builds on the observation that the shares the same input feature space with that use the corresponding for prompt embedding. As a result, aligning VLMs with diffusion decoders can be simplified through alignment with the LLM decoder. Without complex training and datasets, ThinkDiff effectively unleashes understanding, reasoning, and composing capabilities in diffusion models. Experiments demonstrate that ThinkDiff significantly improves accuracy from 19.2% to 46.3% on the challenging CoBSAT benchmark for multimodal in-context reasoning generation, with only 5 hours of training on 4 A100 GPUs. Additionally, ThinkDiff demonstrates exceptional performance in composing multiple images and texts into logically coherent images. Project page: https://mizhenxing.github.io/ThinkDiff.

Paper Structure

This paper contains 23 sections, 5 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: (a) Our ThinkDiff reasons over interleaved images (a flying monkey and a flying cat) and text prompts (monkey, cat, and zebra) to generate a logically correct and high-quality image (a flying zebra). The ground truth reasoning answer is provided as a reference for readers. (b) ThinkDiff composes images and texts into a coherent and reasonable image.
  • Figure 2: (a) Reconstruction-based diffusion finetuning integrates image features using a diffusion loss, focusing on pixel-level image reconstruction without reasoning. (b) ThinkDiff aligns a VLM to an LLM decoder by vision-language training on image-caption datasets. In inference (dotted lines), it transfers multimodal in-context reasoning capabilities from the VLM to a diffusion decoder.
  • Figure 3: Several diffusion models share a language encoder with encoder-decoder LLMs, allowing aligning with diffusion decoders through aligning with LLM decoders.
  • Figure 4: (a) In ThinkDiff-LVLM training, the LVLM processes an image and a text to generate text tokens and token features, with some token features randomly masked. Unmasked token features are passed to a trainable aligner network and an LLM decoder, predicting masked text tokens supervised by cross-entropy loss. In inference, the LLM decoder is replaced by a diffusion decoder, enabling in-context reasoning image generation from interleaved images and texts. (b) In ThinkDiff-CLIP training, a CLIP vision model extracts image token features which are then mapped by a trainable aligner network. A part of the image caption is encoded by the LLM encoder and concatenated with image tokens. These combined tokens are passed to the LLM decoder to predict the next part of the caption supervised by cross-entropy loss. In inference, the LLM decoder is replaced by a diffusion encoder, allowing coherent image generation based on multimodal context.
  • Figure 5: 2-shot evaluation results on CoBSAT. The input structure is similar to Figure \ref{['fig:teaser']}a. Given multimodal inputs, ThinkDiff-LVLM accurately captures both implicit attributes (e.g., wicker material) and explicit attributes (e.g. car), and generates a logically correct image (wicker car). In contrast, methods such as SEED-LLaMA gemaking, Emu sun2023generative and GILL koh2024generating produce inaccurate and lower-quality images. The ground truth implicit attribute is highlighted in red for readers' reference. See more results in Appendix Figure \ref{['fig:appendix_reasoning_shot2_compare']} and \ref{['fig:appendix_reasoning_shot2']}.
  • ...and 9 more figures