Table of Contents
Fetching ...

X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model

Lingmin Ran, Xiaodong Cun, Jia-Wei Liu, Rui Zhao, Song Zijie, Xintao Wang, Jussi Keppo, Mike Zheng Shou

TL;DR

X-Adapter addresses the plugin incompatibility problem that arises when upgrading large diffusion models by introducing a universal adapter that sits between the base plugin connectors and the upgraded model. It freezes both the base and upgraded UNets and learns lightweight per-layer feature-mapping adapters to guide the upgraded model, with training performed in a plugin-free setting using dual latent streams and a null-text strategy, plus a two-stage, SDEdit-inspired inference to align latents. The approach enables universal compatibility and remix of plugins across model versions (e.g., ControlNet from the base and LoRA from the upgraded model) with empirical validation on common plugins. This work reduces maintenance overhead during model upgrades and broadens cross-version plugin applicability, benefiting the diffusion community and downstream users.

Abstract

We introduce X-Adapter, a universal upgrader to enable the pretrained plug-and-play modules (e.g., ControlNet, LoRA) to work directly with the upgraded text-to-image diffusion model (e.g., SDXL) without further retraining. We achieve this goal by training an additional network to control the frozen upgraded model with the new text-image data pairs. In detail, X-Adapter keeps a frozen copy of the old model to preserve the connectors of different plugins. Additionally, X-Adapter adds trainable mapping layers that bridge the decoders from models of different versions for feature remapping. The remapped features will be used as guidance for the upgraded model. To enhance the guidance ability of X-Adapter, we employ a null-text training strategy for the upgraded model. After training, we also introduce a two-stage denoising strategy to align the initial latents of X-Adapter and the upgraded model. Thanks to our strategies, X-Adapter demonstrates universal compatibility with various plugins and also enables plugins of different versions to work together, thereby expanding the functionalities of diffusion community. To verify the effectiveness of the proposed method, we conduct extensive experiments and the results show that X-Adapter may facilitate wider application in the upgraded foundational diffusion model.

X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model

TL;DR

X-Adapter addresses the plugin incompatibility problem that arises when upgrading large diffusion models by introducing a universal adapter that sits between the base plugin connectors and the upgraded model. It freezes both the base and upgraded UNets and learns lightweight per-layer feature-mapping adapters to guide the upgraded model, with training performed in a plugin-free setting using dual latent streams and a null-text strategy, plus a two-stage, SDEdit-inspired inference to align latents. The approach enables universal compatibility and remix of plugins across model versions (e.g., ControlNet from the base and LoRA from the upgraded model) with empirical validation on common plugins. This work reduces maintenance overhead during model upgrades and broadens cross-version plugin applicability, benefiting the diffusion community and downstream users.

Abstract

We introduce X-Adapter, a universal upgrader to enable the pretrained plug-and-play modules (e.g., ControlNet, LoRA) to work directly with the upgraded text-to-image diffusion model (e.g., SDXL) without further retraining. We achieve this goal by training an additional network to control the frozen upgraded model with the new text-image data pairs. In detail, X-Adapter keeps a frozen copy of the old model to preserve the connectors of different plugins. Additionally, X-Adapter adds trainable mapping layers that bridge the decoders from models of different versions for feature remapping. The remapped features will be used as guidance for the upgraded model. To enhance the guidance ability of X-Adapter, we employ a null-text training strategy for the upgraded model. After training, we also introduce a two-stage denoising strategy to align the initial latents of X-Adapter and the upgraded model. Thanks to our strategies, X-Adapter demonstrates universal compatibility with various plugins and also enables plugins of different versions to work together, thereby expanding the functionalities of diffusion community. To verify the effectiveness of the proposed method, we conduct extensive experiments and the results show that X-Adapter may facilitate wider application in the upgraded foundational diffusion model.
Paper Structure (6 sections, 4 equations, 3 figures)

This paper contains 6 sections, 4 equations, 3 figures.

Figures (3)

  • Figure 1: Given pretrained plug-and-play modules (e.g., ControlNet, LoRA) of the base diffusion model (e.g., Stable Diffusion 1.5), the proposed X-Adapter can universally upgrade these plugins, enabling them directly work with the upgraded Model (e.g., SDXL) without further retraining. Text prompts: "1girl, solo, smile, looking at viewer, holding flowers""Apply face paint""1girl, upper body, flowers""A colorful lotus, ink""Best quality, extremely detailed""A fox made of water" from left to right, top to bottom.
  • Figure 2: Task Definition. Different from the previous method to train each plugin individually, our method only trains a single X-Adapter to all the fixed downstream plugins.
  • Figure 3: Method Overview. In training, we add different noises to both the upgraded model and X-Adapter under the latent domain of base and upgraded model. By setting the prompt of the upgraded model to empty and training the mapping layers, X-Adapter learns to guide the upgraded model. In testing, (a) we can directly apply the plugins on the X-Adapter for the upgraded model. (b) A two-stage influence scheme is introduced to improve image quality.