Table of Contents
Fetching ...

WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens

Jian Yang, Dacheng Yin, Xiaoxuan He, Yong Li, Fengyun Rao, Jing Lyu, Wei Zhai, Yang Cao, Zheng-Jun Zha

TL;DR

WeMMU tackles the generalization fragility of bridging vision-language models with diffusion generators by replacing fixed learnable query tokens with Noisy Query Tokens that sample from a distribution, enabling robust, continual learning. A VAE-based detail path is added to preserve high-frequency image information without overburdening the diffusion backbone, while a Position MLP enforces spatial conditioning. The approach achieves deep semantic alignment with high editing fidelity, and a four-stage curriculum ensures stable adaptation to increasingly complex tasks, including multi-image editing, with minimal forgetting of prior capabilities. Empirical results on generation and editing benchmarks show competitive or superior performance compared to state-of-the-art unified approaches, validating the meta-architecture's division-of-labor efficiency and robustness.

Abstract

Recent progress in multimodal large language models (MLLMs) has highlighted the challenge of efficiently bridging pre-trained Vision-Language Models (VLMs) with Diffusion Models. While methods using a fixed number of learnable query tokens offer computational efficiency, they suffer from task generalization collapse, failing to adapt to new tasks that are distant from their pre-training tasks. To overcome this, we propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization, enhancing continual learning. Additionally, we introduce a VAE branch with linear projection to recover fine-grained image details. Experimental results confirm our approach mitigates generalization collapse and enables stable continual learning across diverse tasks.

WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens

TL;DR

WeMMU tackles the generalization fragility of bridging vision-language models with diffusion generators by replacing fixed learnable query tokens with Noisy Query Tokens that sample from a distribution, enabling robust, continual learning. A VAE-based detail path is added to preserve high-frequency image information without overburdening the diffusion backbone, while a Position MLP enforces spatial conditioning. The approach achieves deep semantic alignment with high editing fidelity, and a four-stage curriculum ensures stable adaptation to increasingly complex tasks, including multi-image editing, with minimal forgetting of prior capabilities. Empirical results on generation and editing benchmarks show competitive or superior performance compared to state-of-the-art unified approaches, validating the meta-architecture's division-of-labor efficiency and robustness.

Abstract

Recent progress in multimodal large language models (MLLMs) has highlighted the challenge of efficiently bridging pre-trained Vision-Language Models (VLMs) with Diffusion Models. While methods using a fixed number of learnable query tokens offer computational efficiency, they suffer from task generalization collapse, failing to adapt to new tasks that are distant from their pre-training tasks. To overcome this, we propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization, enhancing continual learning. Additionally, we introduce a VAE branch with linear projection to recover fine-grained image details. Experimental results confirm our approach mitigates generalization collapse and enables stable continual learning across diverse tasks.

Paper Structure

This paper contains 47 sections, 11 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Task Generalization Collapse. Sequential training (middle) fails editing, merely reconstructs the input. Joint training (right) works but is unsustainable, requiring full retraining for new tasks.
  • Figure 1: A gallery of diverse text-to-image generation results from our 'WeMMU' model, synthesized at 1024x1024 resolution.
  • Figure 2: Overall framework. We bridge a frozen VLM and tunable Diffusion Model via Noisy Query Tokens, sampled from $\mathcal{N}(0,\textit{I})$ per step. These tokens aggregate image and text features in a parallel generation pathway, while the original VLM pathway remains frozen. A VAE branch injects fine-grained details via a linear layer. Position MLP adds 2D spatial cues and projects features to condition the diffusion model. This design maintains clear labor division: VLM handles understanding, diffusion model focuses on generation.
  • Figure 3: Analysis of query token attention mechanisms. The prompt is "Remove the 'MILLER MOTORCARS' text positioned across the top center of the image". (Bottom rows) Learnable queries show a strong attention bias towards image tokens. Our Noisy Queries shift focus to the text tokens (right of red line), prioritizing instruction following. The VAE branch (right of blue line) helps balance this attention.
  • Figure 4: Justifying the VAE Branch design. (Left) Fine-tuning the native ViT of Qwen2.5-VL leads to training collapse. (Right) Loss curves for different VAE branch connection methods, showing a simple Linear layer (the red line) provides the fastest and most stable convergence.
  • ...and 1 more figures