VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control

Shaojin Wu; Fei Ding; Mengqi Huang; Wei Liu; Qian He

VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control

Shaojin Wu, Fei Ding, Mengqi Huang, Wei Liu, Qian He

TL;DR

VMix tackles the gap between text fidelity and fine-grained aesthetics in diffusion-based text-to-image generation by disentangling prompts into content and aesthetic descriptions and applying a value-mixed cross-attention mechanism. It introduces an AesEmb aesthetic embedding initialized from opposing label pairs, and a zero-initialized projection to inject aesthetics through a dual-branch cross-attention that preserves the original attention map. Trained with a frozen base model and LoRA, VMix remains plug-and-play and compatible with ControlNet and IP-Adapter, delivering stronger aesthetic quality without sacrificing text alignment. Empirical results on MJHQ-30K and LAION-HQ10K show superior AES scores and competitive FID/CLIP, validating its effectiveness and practical impact for high-quality, controllable diffusion generation.

Abstract

While diffusion models show extraordinary talents in text-to-image generation, they may still fail to generate highly aesthetic images. More specifically, there is still a gap between the generated images and the real-world aesthetic images in finer-grained dimensions including color, lighting, composition, etc. In this paper, we propose Cross-Attention Value Mixing Control (VMix) Adapter, a plug-and-play aesthetics adapter, to upgrade the quality of generated images while maintaining generality across visual concepts by (1) disentangling the input text prompt into the content description and aesthetic description by the initialization of aesthetic embedding, and (2) integrating aesthetic conditions into the denoising process through value-mixed cross-attention, with the network connected by zero-initialized linear layers. Our key insight is to enhance the aesthetic presentation of existing diffusion models by designing a superior condition control method, all while preserving the image-text alignment. Through our meticulous design, VMix is flexible enough to be applied to community models for better visual performance without retraining. To validate the effectiveness of our method, we conducted extensive experiments, showing that VMix outperforms other state-of-the-art methods and is compatible with other community modules (e.g., LoRA, ControlNet, and IPAdapter) for image generation. The project page is https://vmix-diffusion.github.io/VMix/.

VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control

TL;DR

Abstract

Paper Structure (20 sections, 9 equations, 14 figures, 3 tables)

This paper contains 20 sections, 9 equations, 14 figures, 3 tables.

Introduction
Related Work
Text-to-Image Models
Improving Text-to-Image Models
Controlling Text-to-Image Models
Methodology
Preliminary
The Disentanglement Text Prompts
Cross-Attention Mixing Control
Training and Inference
Experiments
Experiments Setting
Qualitative Analyses
Quantitative Evaluations
Ablation Study
...and 5 more sections

Figures (14)

Figure 1: Comparison of text fidelity and visual aesthetics between SDXL podell2023sdxl, DPO wallace2024diffusion, and our VMix. DPO can generate attributes that SDXL fails to produce, but it fails to align with human visual fine-grained preferences. Our method achieves better text fidelity and visual aesthetics simultaneously.
Figure 2: Illustration of of VMix. (a)In the initialization stage, pre-defined aesthetic labels are transformed into [CLS] tokens through CLIP, thereby obtaining AesEmb, which only need to be processed once at the beginning of training. (b)In the training stage, a project layer first maps the input aesthetic description $y_{aes}$ into an embedding $f_a$ of the same token dimension as the content text embedding $f_t$. The text embedding $f_t$ is then integrated into the denoising network through value-mixed cross-attention. (c)In the inference stage, VMix extract all positive aesthetic embedding from AesEmb to form the aesthetic input, along with the content input, is fed into the model for the denoising process.
Figure 3: Qualitative comparison with various state-of-the-art methods. All results are based on Stable Diffusion rombach2022high. Our VMix method outperforms others, significantly enhancing the quality of image generation across various fine-grained aesthetic dimensions.
Figure 4: Qualitative comparison with various state-of-the-art methods. All the results of the methods are based on the SDXL podell2023sdxl. Our VMix method outperforms others, significantly enhancing the quality of image generation.
Figure 5: Qualitative results. We compare images generated by VMix-integrated personalized models with those from standard personalized models. On the left are images produced by the personalized model with VMix integration, while on the right are images from the standard personalized model without modifications.
...and 9 more figures

VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control

TL;DR

Abstract

VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control

Authors

TL;DR

Abstract

Table of Contents

Figures (14)