Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion
Xuantong Liu, Tianyang Hu, Wenjia Wang, Kenji Kawaguchi, Yuan Yao
TL;DR
This work reframes conditional diffusion-based text-to-image generation as a model inversion problem on discriminative Vision-Language Models (VLMs). It introduces a training-free pipeline that optimizes latent representations in a latent diffusion space under fixed VLM supervision (BLIP-2), optionally augmented with Score Distillation Sampling (SDS) to boost fidelity. Key contributions include formalizing VLM inversion for alignment, introducing augmentation-regularized losses to prevent adversarial artifacts, and demonstrating near-SOTA text-image alignment on the T2I-CompBench benchmark. The approach is flexible and data-free, offering a new direction for controllable generation with potential extensions to multi-referee and cross-modal tasks, albeit with trade-offs in stability and spatial reasoning when using a single VLM referee.
Abstract
As a dominant force in text-to-image generation tasks, Diffusion Probabilistic Models (DPMs) face a critical challenge in controllability, struggling to adhere strictly to complex, multi-faceted instructions. In this work, we aim to address this alignment challenge for conditional generation tasks. First, we provide an alternative view of state-of-the-art DPMs as a way of inverting advanced Vision-Language Models (VLMs). With this formulation, we naturally propose a training-free approach that bypasses the conventional sampling process associated with DPMs. By directly optimizing images with the supervision of discriminative VLMs, the proposed method can potentially achieve a better text-image alignment. As proof of concept, we demonstrate the pipeline with the pre-trained BLIP-2 model and identify several key designs for improved image generation. To further enhance the image fidelity, a Score Distillation Sampling module of Stable Diffusion is incorporated. By carefully balancing the two components during optimization, our method can produce high-quality images with near state-of-the-art performance on T2I-Compbench.
