Table of Contents
Fetching ...

Instructing Text-to-Image Diffusion Models via Classifier-Guided Semantic Optimization

Yuanyuan Chang, Yinghua Yao, Tao Qin, Mengmeng Wang, Ivor Tsang, Guang Dai

TL;DR

This work tackles the challenge of editing images generated by text-to-image diffusion models without relying on manual prompts or retraining the diffusion model. It introduces classifier-guided semantic optimization (CASO), which learns a set of semantic embeddings $\{e_a\}$ for target attributes by optimizing an editing loss against a fixed attribute classifier, tying embeddings to attribute class means through neural-collapse theory. The approach enables disentangled, dataset-level edits across diverse domains, with a reconstruction term to preserve non-target details, and demonstrates strong generalization, multi-attribute editing, and reconstruction quality improvements. Practically, CASO provides a lightweight, training-efficient pathway to precise, controllable edits in diffusion-based generation, while highlighting ethical considerations around potential misuse and advocating responsible deployment.

Abstract

Text-to-image diffusion models have emerged as powerful tools for high-quality image generation and editing. Many existing approaches rely on text prompts as editing guidance. However, these methods are constrained by the need for manual prompt crafting, which can be time-consuming, introduce irrelevant details, and significantly limit editing performance. In this work, we propose optimizing semantic embeddings guided by attribute classifiers to steer text-to-image models toward desired edits, without relying on text prompts or requiring any training or fine-tuning of the diffusion model. We utilize classifiers to learn precise semantic embeddings at the dataset level. The learned embeddings are theoretically justified as the optimal representation of attribute semantics, enabling disentangled and accurate edits. Experiments further demonstrate that our method achieves high levels of disentanglement and strong generalization across different domains of data.

Instructing Text-to-Image Diffusion Models via Classifier-Guided Semantic Optimization

TL;DR

This work tackles the challenge of editing images generated by text-to-image diffusion models without relying on manual prompts or retraining the diffusion model. It introduces classifier-guided semantic optimization (CASO), which learns a set of semantic embeddings for target attributes by optimizing an editing loss against a fixed attribute classifier, tying embeddings to attribute class means through neural-collapse theory. The approach enables disentangled, dataset-level edits across diverse domains, with a reconstruction term to preserve non-target details, and demonstrates strong generalization, multi-attribute editing, and reconstruction quality improvements. Practically, CASO provides a lightweight, training-efficient pathway to precise, controllable edits in diffusion-based generation, while highlighting ethical considerations around potential misuse and advocating responsible deployment.

Abstract

Text-to-image diffusion models have emerged as powerful tools for high-quality image generation and editing. Many existing approaches rely on text prompts as editing guidance. However, these methods are constrained by the need for manual prompt crafting, which can be time-consuming, introduce irrelevant details, and significantly limit editing performance. In this work, we propose optimizing semantic embeddings guided by attribute classifiers to steer text-to-image models toward desired edits, without relying on text prompts or requiring any training or fine-tuning of the diffusion model. We utilize classifiers to learn precise semantic embeddings at the dataset level. The learned embeddings are theoretically justified as the optimal representation of attribute semantics, enabling disentangled and accurate edits. Experiments further demonstrate that our method achieves high levels of disentanglement and strong generalization across different domains of data.

Paper Structure

This paper contains 30 sections, 3 theorems, 27 equations, 17 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

neuralcollapse For a sufficiently large classifier network, the last layer of the classifier $w_a$ will converge to the globally-centered class mean $\mu_a$, namely, where $a=1, 2, \ldots, K$.

Figures (17)

  • Figure 1: ClAssifier-guide Semantic Optimization (CASO). The trainable continuous semantic embedding for the target attribute $a$, guides Stable Diffusion for desired edits.
  • Figure 2: CASO Edit Result. Our method generalizes well to data with different styles.
  • Figure 3: Comparison of different methods for attribute "Mustache". Our method shows the best generalization because it captures the exact semantics at the dataset level.
  • Figure 4: CASO Interpolation Results. Our method allow users to implement fine-grained editing and bidirectional editing by simply changing the classifier free guidance scale.
  • Figure 5: CASO Muti-attribute Edit. In complex and challenging scenarios, our method can still achieve perfect editing.
  • ...and 12 more figures

Theorems & Definitions (7)

  • Definition 1: Globally-centered attribute class mean $\mu_a$
  • Theorem 1
  • Proposition 2
  • Remark 1
  • Definition 2
  • Theorem 3: neuralcollapsepapyan2020prevalence
  • proof