Table of Contents
Fetching ...

Gradient-Free Textual Inversion

Zhengcong Fei, Mingyuan Fan, Junshi Huang

TL;DR

This work addresses personalized text-to-image generation under inference-only constraints, where gradients are unavailable. It introduces a gradient-free textual inversion framework that optimizes a pseudo-token embedding via CMA-ES in a low-dimensional subspace, using e = e_0 + W_p Q with a robust initialization from cross-attention. Key contributions include adaptive initialization, subspace decomposition (PCA and Prior Normalization), and demonstrating competitive results against gradient-based methods in both reconstruction and editability tasks, plus qualitative demonstrations in text-guided synthesis and style transfer. The approach enables efficient, device-friendly personalization with potential safer and more scalable deployment in real-world settings.

Abstract

Recent works on personalized text-to-image generation usually learn to bind a special token with specific subjects or styles of a few given images by tuning its embedding through gradient descent. It is natural to question whether we can optimize the textual inversions by only accessing the process of model inference. As only requiring the forward computation to determine the textual inversion retains the benefits of less GPU memory, simple deployment, and secure access for scalable models. In this paper, we introduce a \emph{gradient-free} framework to optimize the continuous textual inversion in an iterative evolutionary strategy. Specifically, we first initialize an appropriate token embedding for textual inversion with the consideration of visual and text vocabulary information. Then, we decompose the optimization of evolutionary strategy into dimension reduction of searching space and non-convex gradient-free optimization in subspace, which significantly accelerates the optimization process with negligible performance loss. Experiments in several applications demonstrate that the performance of text-to-image model equipped with our proposed gradient-free method is comparable to that of gradient-based counterparts with variant GPU/CPU platforms, flexible employment, as well as computational efficiency.

Gradient-Free Textual Inversion

TL;DR

This work addresses personalized text-to-image generation under inference-only constraints, where gradients are unavailable. It introduces a gradient-free textual inversion framework that optimizes a pseudo-token embedding via CMA-ES in a low-dimensional subspace, using e = e_0 + W_p Q with a robust initialization from cross-attention. Key contributions include adaptive initialization, subspace decomposition (PCA and Prior Normalization), and demonstrating competitive results against gradient-based methods in both reconstruction and editability tasks, plus qualitative demonstrations in text-guided synthesis and style transfer. The approach enables efficient, device-friendly personalization with potential safer and more scalable deployment in real-world settings.

Abstract

Recent works on personalized text-to-image generation usually learn to bind a special token with specific subjects or styles of a few given images by tuning its embedding through gradient descent. It is natural to question whether we can optimize the textual inversions by only accessing the process of model inference. As only requiring the forward computation to determine the textual inversion retains the benefits of less GPU memory, simple deployment, and secure access for scalable models. In this paper, we introduce a \emph{gradient-free} framework to optimize the continuous textual inversion in an iterative evolutionary strategy. Specifically, we first initialize an appropriate token embedding for textual inversion with the consideration of visual and text vocabulary information. Then, we decompose the optimization of evolutionary strategy into dimension reduction of searching space and non-convex gradient-free optimization in subspace, which significantly accelerates the optimization process with negligible performance loss. Experiments in several applications demonstrate that the performance of text-to-image model equipped with our proposed gradient-free method is comparable to that of gradient-based counterparts with variant GPU/CPU platforms, flexible employment, as well as computational efficiency.
Paper Structure (27 sections, 5 equations, 6 figures)

This paper contains 27 sections, 5 equations, 6 figures.

Figures (6)

  • Figure 1: Overview of gradient-free textual inversion framework for personalized text-to-image generation. Specifically, the evolution strategy is performed iteratively to explore and exploit pseudo-token embedding. To accelerate the optimization, ($\textbf{i}$) the textual inversion is initialized with weighted cross-attention between given images and vocabulary, and ($\textbf{ii}$) optimization is conducted in a decomposition subspace through PCA or prior normalization.
  • Figure 2: Personalized text-guided image generation. It demonstrates that with gradient-free optimization method, we can use the pseudo-token for the learned concept to create personalized images as if it was a normal word token. Importantly, our method performs on par and sometimes better than standard textual Inversion on this task.
  • Figure 3: Personalized style-guided image generation. As the textual-embedding space can represent more abstract concepts, including different art styles, we can also discover that pseudo-token embedding with gradient-free optimization can represent the style of given images powerfully.
  • Figure 4: Quantitative analysis in CLIP-based evaluations compared with standard textual inversion and gradient-free inversion variants including different pseudo token numbers and decomposition subspace dimensionality.
  • Figure 5: Effect of general condition initialization. We can see that cross-attention method provides a good initialization point for special pseudo-token embedding and obtains a prominently faster convergence.
  • ...and 1 more figures