Table of Contents
Fetching ...

Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion

Xuantong Liu, Tianyang Hu, Wenjia Wang, Kenji Kawaguchi, Yuan Yao

TL;DR

This work reframes conditional diffusion-based text-to-image generation as a model inversion problem on discriminative Vision-Language Models (VLMs). It introduces a training-free pipeline that optimizes latent representations in a latent diffusion space under fixed VLM supervision (BLIP-2), optionally augmented with Score Distillation Sampling (SDS) to boost fidelity. Key contributions include formalizing VLM inversion for alignment, introducing augmentation-regularized losses to prevent adversarial artifacts, and demonstrating near-SOTA text-image alignment on the T2I-CompBench benchmark. The approach is flexible and data-free, offering a new direction for controllable generation with potential extensions to multi-referee and cross-modal tasks, albeit with trade-offs in stability and spatial reasoning when using a single VLM referee.

Abstract

As a dominant force in text-to-image generation tasks, Diffusion Probabilistic Models (DPMs) face a critical challenge in controllability, struggling to adhere strictly to complex, multi-faceted instructions. In this work, we aim to address this alignment challenge for conditional generation tasks. First, we provide an alternative view of state-of-the-art DPMs as a way of inverting advanced Vision-Language Models (VLMs). With this formulation, we naturally propose a training-free approach that bypasses the conventional sampling process associated with DPMs. By directly optimizing images with the supervision of discriminative VLMs, the proposed method can potentially achieve a better text-image alignment. As proof of concept, we demonstrate the pipeline with the pre-trained BLIP-2 model and identify several key designs for improved image generation. To further enhance the image fidelity, a Score Distillation Sampling module of Stable Diffusion is incorporated. By carefully balancing the two components during optimization, our method can produce high-quality images with near state-of-the-art performance on T2I-Compbench.

Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion

TL;DR

This work reframes conditional diffusion-based text-to-image generation as a model inversion problem on discriminative Vision-Language Models (VLMs). It introduces a training-free pipeline that optimizes latent representations in a latent diffusion space under fixed VLM supervision (BLIP-2), optionally augmented with Score Distillation Sampling (SDS) to boost fidelity. Key contributions include formalizing VLM inversion for alignment, introducing augmentation-regularized losses to prevent adversarial artifacts, and demonstrating near-SOTA text-image alignment on the T2I-CompBench benchmark. The approach is flexible and data-free, offering a new direction for controllable generation with potential extensions to multi-referee and cross-modal tasks, albeit with trade-offs in stability and spatial reasoning when using a single VLM referee.

Abstract

As a dominant force in text-to-image generation tasks, Diffusion Probabilistic Models (DPMs) face a critical challenge in controllability, struggling to adhere strictly to complex, multi-faceted instructions. In this work, we aim to address this alignment challenge for conditional generation tasks. First, we provide an alternative view of state-of-the-art DPMs as a way of inverting advanced Vision-Language Models (VLMs). With this formulation, we naturally propose a training-free approach that bypasses the conventional sampling process associated with DPMs. By directly optimizing images with the supervision of discriminative VLMs, the proposed method can potentially achieve a better text-image alignment. As proof of concept, we demonstrate the pipeline with the pre-trained BLIP-2 model and identify several key designs for improved image generation. To further enhance the image fidelity, a Score Distillation Sampling module of Stable Diffusion is incorporated. By carefully balancing the two components during optimization, our method can produce high-quality images with near state-of-the-art performance on T2I-Compbench.
Paper Structure (36 sections, 4 theorems, 21 equations, 13 figures, 9 tables, 1 algorithm)

This paper contains 36 sections, 4 theorems, 21 equations, 13 figures, 9 tables, 1 algorithm.

Key Result

Proposition 3.1

Under some mild regularity conditions on the augmentation, $\widetilde{L}_y$ is strictly smoother than $L_y$. Particularly, if $A(\mathbf{x}) = \mathbf{x}+\epsilon$ where $\epsilon$ is Gaussian, $\widetilde{L}_y$ has infinite smoothness and is Lipschitz continuous.

Figures (13)

  • Figure 1: Our method can effectively generate faithful images strictly following the prompt "Two hot dogs sit on a white paper plate near a soda cup which is sitting on a green picnic table while a bike and a silver car are parked nearby”. However, the baseline methods, including Stable Diffusion (SD) 1.5, Attend-and-Excite chefer2023attendandexcite, PixArt-$\alpha$chen2023pixart, DALLE-2 dalle2, DALLE-3 betker2023dalle3, struggle to generate the right images encountering this kind of complex compositional prompts.
  • Figure 2: SOTA image generation models require collecting a large number of images which are then labeled in detail by a VLM for training. During the inference phase, the condition ($\mathbf{y}$) is injected into the denoising process of random noise ($\mathbf{x}_\mathrm{T}$) through a cross-attention mechanism. The key difference in our approach is that we generate images by directly inversing the VLM, eliminating the need to train a specialized generation model.
  • Figure 3: The illustration of the influence of augmentation regularization on BLIP-2 inversion. We show the result of no augmentation (top row), 15 augmentations (middle row), and 30 augmentations (bottom row).
  • Figure 4: Qualitative Comparison using prompts from Attribute-binding of T2I-CompBench huang2023t2i. We generate two images for each prompt with the same two random seeds for all methods.
  • Figure 5: A comparison of the images generated from the Stable Diffusion (SD) reverse process with CFG scale=7.5, 30 and the Score Distillation Sampling (SDS) process with CFG scale=30.
  • ...and 8 more figures

Theorems & Definitions (7)

  • Proposition 3.1: informal
  • Proposition 3.2: informal
  • Proposition 2.1: Improved smoothness
  • proof
  • Proposition 2.2: Improved condition number of convex optimization
  • Remark 2.3
  • proof