Table of Contents
Fetching ...

CTRLorALTer: Conditional LoRAdapter for Efficient 0-Shot Control & Altering of T2I Models

Nick Stracke, Stefan Andreas Baumann, Joshua M. Susskind, Miguel Angel Bautista, Björn Ommer

TL;DR

LoRAdapter is an efficient, powerful, and architecture-agnostic approach to condition text-to-image diffusion models, which enables fine-grained control conditioning during generation and outperforms recent state-of-the-art approaches.

Abstract

Text-to-image generative models have become a prominent and powerful tool that excels at generating high-resolution realistic images. However, guiding the generative process of these models to consider detailed forms of conditioning reflecting style and/or structure information remains an open problem. In this paper, we present LoRAdapter, an approach that unifies both style and structure conditioning under the same formulation using a novel conditional LoRA block that enables zero-shot control. LoRAdapter is an efficient, powerful, and architecture-agnostic approach to condition text-to-image diffusion models, which enables fine-grained control conditioning during generation and outperforms recent state-of-the-art approaches.

CTRLorALTer: Conditional LoRAdapter for Efficient 0-Shot Control & Altering of T2I Models

TL;DR

LoRAdapter is an efficient, powerful, and architecture-agnostic approach to condition text-to-image diffusion models, which enables fine-grained control conditioning during generation and outperforms recent state-of-the-art approaches.

Abstract

Text-to-image generative models have become a prominent and powerful tool that excels at generating high-resolution realistic images. However, guiding the generative process of these models to consider detailed forms of conditioning reflecting style and/or structure information remains an open problem. In this paper, we present LoRAdapter, an approach that unifies both style and structure conditioning under the same formulation using a novel conditional LoRA block that enables zero-shot control. LoRAdapter is an efficient, powerful, and architecture-agnostic approach to condition text-to-image diffusion models, which enables fine-grained control conditioning during generation and outperforms recent state-of-the-art approaches.
Paper Structure (35 sections, 8 equations, 13 figures, 12 tables)

This paper contains 35 sections, 8 equations, 13 figures, 12 tables.

Figures (13)

  • Figure 1: LoRAdapter allows structure and style control of the image generation process of text-to-image models in a zero-shot manner. Our approach enables powerful fine-grained and efficient unified control over both structure and style conditioning using conditional LoRA blocks.
  • Figure 2: Overview of the proposed conditional LoRA block. The original weight matrix $W^{(i)}_0$ is frozen while all other layers are trained. $\phi$ is an affine transformation that operates on the low-dimensional embedding $A^{(i)}x$ and introduces the conditioning. The local mapper network $m_L^{(i)}$ predicts the scale and shift parameters $\beta, \gamma$ for the affine transformation. Typically, we set $m_L^{(i)}$ to be a small network. If complex transformations are required to map the conditioning $c$, this happens in $m_S$ since it is shared across all adapted layers.
  • Figure 3: Visualization of implementations of our conditional LoRAs for specific layers.
  • Figure 4: Samples from our method with style conditioning compared against other methods. We used an empty prompt and only conditioned on the image. We generally perform on par with IP-Adapter and outperform it on some samples. Note that the third image from the left is less degraded, and the third image from the right captures the mane of the horse better.
  • Figure 5: Samples from our method with structural conditioning compared against other methods. Note that for our method, especially compared with T2I Adapter, the details of the images are substantially more closely aligned with the depth prompt (see, e.g., the lamp in the background of the living room scene and the side table's legs, or the salad on the pizza)
  • ...and 8 more figures