Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion

Xuantong Liu; Tianyang Hu; Wenjia Wang; Kenji Kawaguchi; Yuan Yao

Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion

Xuantong Liu, Tianyang Hu, Wenjia Wang, Kenji Kawaguchi, Yuan Yao

TL;DR

This work reframes conditional diffusion-based text-to-image generation as a model inversion problem on discriminative Vision-Language Models (VLMs). It introduces a training-free pipeline that optimizes latent representations in a latent diffusion space under fixed VLM supervision (BLIP-2), optionally augmented with Score Distillation Sampling (SDS) to boost fidelity. Key contributions include formalizing VLM inversion for alignment, introducing augmentation-regularized losses to prevent adversarial artifacts, and demonstrating near-SOTA text-image alignment on the T2I-CompBench benchmark. The approach is flexible and data-free, offering a new direction for controllable generation with potential extensions to multi-referee and cross-modal tasks, albeit with trade-offs in stability and spatial reasoning when using a single VLM referee.

Abstract

As a dominant force in text-to-image generation tasks, Diffusion Probabilistic Models (DPMs) face a critical challenge in controllability, struggling to adhere strictly to complex, multi-faceted instructions. In this work, we aim to address this alignment challenge for conditional generation tasks. First, we provide an alternative view of state-of-the-art DPMs as a way of inverting advanced Vision-Language Models (VLMs). With this formulation, we naturally propose a training-free approach that bypasses the conventional sampling process associated with DPMs. By directly optimizing images with the supervision of discriminative VLMs, the proposed method can potentially achieve a better text-image alignment. As proof of concept, we demonstrate the pipeline with the pre-trained BLIP-2 model and identify several key designs for improved image generation. To further enhance the image fidelity, a Score Distillation Sampling module of Stable Diffusion is incorporated. By carefully balancing the two components during optimization, our method can produce high-quality images with near state-of-the-art performance on T2I-Compbench.

Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion

TL;DR

Abstract

Paper Structure (36 sections, 4 theorems, 21 equations, 13 figures, 9 tables, 1 algorithm)

This paper contains 36 sections, 4 theorems, 21 equations, 13 figures, 9 tables, 1 algorithm.

Introduction
Preliminary
Discriminative Vision Language Models
Latent Diffusion Models
Score Distillation
Conditional generation via model inversion
Vision Language Model Inversion for Alignment
Problem formulation
Parameterization
Necessity of Regularization
Ideal Augmentations for BLIP-2 Inversion
SDS for Improved Fidelity
Delicate Balance
Experiments
Quantitative Results
...and 21 more sections

Key Result

Proposition 3.1

Under some mild regularity conditions on the augmentation, $\widetilde{L}_y$ is strictly smoother than $L_y$. Particularly, if $A(\mathbf{x}) = \mathbf{x}+\epsilon$ where $\epsilon$ is Gaussian, $\widetilde{L}_y$ has infinite smoothness and is Lipschitz continuous.

Figures (13)

Figure 1: Our method can effectively generate faithful images strictly following the prompt "Two hot dogs sit on a white paper plate near a soda cup which is sitting on a green picnic table while a bike and a silver car are parked nearby”. However, the baseline methods, including Stable Diffusion (SD) 1.5, Attend-and-Excite chefer2023attendandexcite, PixArt-$\alpha$chen2023pixart, DALLE-2 dalle2, DALLE-3 betker2023dalle3, struggle to generate the right images encountering this kind of complex compositional prompts.
Figure 2: SOTA image generation models require collecting a large number of images which are then labeled in detail by a VLM for training. During the inference phase, the condition ($\mathbf{y}$) is injected into the denoising process of random noise ($\mathbf{x}_\mathrm{T}$) through a cross-attention mechanism. The key difference in our approach is that we generate images by directly inversing the VLM, eliminating the need to train a specialized generation model.
Figure 3: The illustration of the influence of augmentation regularization on BLIP-2 inversion. We show the result of no augmentation (top row), 15 augmentations (middle row), and 30 augmentations (bottom row).
Figure 4: Qualitative Comparison using prompts from Attribute-binding of T2I-CompBench huang2023t2i. We generate two images for each prompt with the same two random seeds for all methods.
Figure 5: A comparison of the images generated from the Stable Diffusion (SD) reverse process with CFG scale=7.5, 30 and the Score Distillation Sampling (SDS) process with CFG scale=30.
...and 8 more figures

Theorems & Definitions (7)

Proposition 3.1: informal
Proposition 3.2: informal
Proposition 2.1: Improved smoothness
proof
Proposition 2.2: Improved condition number of convex optimization
Remark 2.3
proof

Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion

TL;DR

Abstract

Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (7)