WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens
Jian Yang, Dacheng Yin, Xiaoxuan He, Yong Li, Fengyun Rao, Jing Lyu, Wei Zhai, Yang Cao, Zheng-Jun Zha
TL;DR
WeMMU tackles the generalization fragility of bridging vision-language models with diffusion generators by replacing fixed learnable query tokens with Noisy Query Tokens that sample from a distribution, enabling robust, continual learning. A VAE-based detail path is added to preserve high-frequency image information without overburdening the diffusion backbone, while a Position MLP enforces spatial conditioning. The approach achieves deep semantic alignment with high editing fidelity, and a four-stage curriculum ensures stable adaptation to increasingly complex tasks, including multi-image editing, with minimal forgetting of prior capabilities. Empirical results on generation and editing benchmarks show competitive or superior performance compared to state-of-the-art unified approaches, validating the meta-architecture's division-of-labor efficiency and robustness.
Abstract
Recent progress in multimodal large language models (MLLMs) has highlighted the challenge of efficiently bridging pre-trained Vision-Language Models (VLMs) with Diffusion Models. While methods using a fixed number of learnable query tokens offer computational efficiency, they suffer from task generalization collapse, failing to adapt to new tasks that are distant from their pre-training tasks. To overcome this, we propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization, enhancing continual learning. Additionally, we introduce a VAE branch with linear projection to recover fine-grained image details. Experimental results confirm our approach mitigates generalization collapse and enables stable continual learning across diverse tasks.
