Table of Contents
Fetching ...

Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion

Samuele Dell'Erba, Andrew D. Bagdanov

TL;DR

Problem: diffusion priors impose substantial training costs for text-to-image generation. Approach: replace priors with Optimization-based Visual Inversion (OVI), optimizing a latent CLIP image embedding to align with text, augmented by Mahalanobis and Nearest-Neighbor regularizers to keep results realistic. Findings: unconstrained OVI often matches TextEmb and reveals benchmark metric issues; constrained variants, especially Nearest-Neighbor, yield higher visual fidelity and competitive scores relative to trained priors like ECLIPSE. Significance: demonstrates a feasible training-free alternative for diffusion priors and motivates re-evaluation of T2I benchmarks to better reflect perceptual quality.

Abstract

Diffusion models have established the state-of-the-art in text-to-image generation, but their performance often relies on a diffusion prior network to translate text embeddings into the visual manifold for easier decoding. These priors are computationally expensive and require extensive training on massive datasets. In this work, we challenge the necessity of a trained prior at all by employing Optimization-based Visual Inversion (OVI), a training-free and data-free alternative, to replace the need for a prior. OVI initializes a latent visual representation from random pseudo-tokens and iteratively optimizes it to maximize the cosine similarity with input textual prompt embedding. We further propose two novel constraints, a Mahalanobis-based and a Nearest-Neighbor loss, to regularize the OVI optimization process toward the distribution of realistic images. Our experiments, conducted on Kandinsky 2.2, show that OVI can serve as an alternative to traditional priors. More importantly, our analysis reveals a critical flaw in current evaluation benchmarks like T2I-CompBench++, where simply using the text embedding as a prior achieves surprisingly high scores, despite lower perceptual quality. Our constrained OVI methods improve visual fidelity over this baseline, with the Nearest-Neighbor approach proving particularly effective, achieving quantitative scores comparable to or higher than the state-of-the-art data-efficient prior, indicating that the idea merits further investigation. The code will be publicly available upon acceptance.

Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion

TL;DR

Problem: diffusion priors impose substantial training costs for text-to-image generation. Approach: replace priors with Optimization-based Visual Inversion (OVI), optimizing a latent CLIP image embedding to align with text, augmented by Mahalanobis and Nearest-Neighbor regularizers to keep results realistic. Findings: unconstrained OVI often matches TextEmb and reveals benchmark metric issues; constrained variants, especially Nearest-Neighbor, yield higher visual fidelity and competitive scores relative to trained priors like ECLIPSE. Significance: demonstrates a feasible training-free alternative for diffusion priors and motivates re-evaluation of T2I benchmarks to better reflect perceptual quality.

Abstract

Diffusion models have established the state-of-the-art in text-to-image generation, but their performance often relies on a diffusion prior network to translate text embeddings into the visual manifold for easier decoding. These priors are computationally expensive and require extensive training on massive datasets. In this work, we challenge the necessity of a trained prior at all by employing Optimization-based Visual Inversion (OVI), a training-free and data-free alternative, to replace the need for a prior. OVI initializes a latent visual representation from random pseudo-tokens and iteratively optimizes it to maximize the cosine similarity with input textual prompt embedding. We further propose two novel constraints, a Mahalanobis-based and a Nearest-Neighbor loss, to regularize the OVI optimization process toward the distribution of realistic images. Our experiments, conducted on Kandinsky 2.2, show that OVI can serve as an alternative to traditional priors. More importantly, our analysis reveals a critical flaw in current evaluation benchmarks like T2I-CompBench++, where simply using the text embedding as a prior achieves surprisingly high scores, despite lower perceptual quality. Our constrained OVI methods improve visual fidelity over this baseline, with the Nearest-Neighbor approach proving particularly effective, achieving quantitative scores comparable to or higher than the state-of-the-art data-efficient prior, indicating that the idea merits further investigation. The code will be publicly available upon acceptance.

Paper Structure

This paper contains 18 sections, 10 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Impact of negative embedding initialization. We show results for the prompt "Blue old car on a beach" (500 OVI steps, 1 token, 512x512). Left: Using a zero tensor results in a strong violet filter artifact and poor quality. Right: Using the ECLIPSE pipeline's default method corrects the color balance and improves overall definition.
  • Figure 2: Cosine similarity between the optimized image embedding and the target text embedding over 1000 OVI steps. As the number of pseudo-tokens increases, the optimization reaches a higher similarity plateau, moving closer to the text embedding.
  • Figure 3: Qualitative comparison of OVI variants, the TextEmb baseline, and ECLIPSE. Unconstrained OVI with more tokens produces results visually similar to TextEmb, while ECLIPSE yields higher fidelity.
  • Figure 4: Evolution of cosine similarity metrics during the optimization process for different OVI configurations. The solid blue curve tracks the similarity between the target text embedding and our optimized OVI embedding. The solid red curve measures the similarity between our OVI embedding and the embedding generated by the ECLIPSE prior. The dashed blue line represents the baseline similarity between the text embedding and the ECLIPSE image embedding. Left: Unconstrained OVI (6 tokens) converges to the text embedding ($>0.9$). Center: Mahalanobis constraint ($\lambda_M=0.009$) restricts text similarity to $\approx 0.83$. Right: Nearest-Neighbor constraint ($\lambda_N=0.5$) stabilizes text similarity around $0.88$ while reaching a similarity of $\approx 0.79$ with ECLIPSE, indicating strong convergence towards the trained prior's manifold.
  • Figure 5: Mode collapse with high Mahalanobis constraints. Regardless of the input prompt, setting $\lambda_M \ge 0.05$ forces the generated image to converge towards a generic "room" representation, likely the mean of the COCO embedding space.
  • ...and 2 more figures