Table of Contents
Fetching ...

Scaling Concept With Text-Guided Diffusion Models

Chao Huang, Susan Liang, Yunlong Tang, Yapeng Tian, Anurag Kumar, Chenliang Xu

TL;DR

ScalingConcept is introduced, a simple yet effective method to scale decomposed concepts up or down in real input without introducing new elements, and enables a variety of novel zero-shot applications across image and audio domains, including tasks such as canonical pose generation and generative sound highlighting or removal.

Abstract

Text-guided diffusion models have revolutionized generative tasks by producing high-fidelity content from text descriptions. They have also enabled an editing paradigm where concepts can be replaced through text conditioning (e.g., a dog to a tiger). In this work, we explore a novel approach: instead of replacing a concept, can we enhance or suppress the concept itself? Through an empirical study, we identify a trend where concepts can be decomposed in text-guided diffusion models. Leveraging this insight, we introduce ScalingConcept, a simple yet effective method to scale decomposed concepts up or down in real input without introducing new elements. To systematically evaluate our approach, we present the WeakConcept-10 dataset, where concepts are imperfect and need to be enhanced. More importantly, ScalingConcept enables a variety of novel zero-shot applications across image and audio domains, including tasks such as canonical pose generation and generative sound highlighting or removal.

Scaling Concept With Text-Guided Diffusion Models

TL;DR

ScalingConcept is introduced, a simple yet effective method to scale decomposed concepts up or down in real input without introducing new elements, and enables a variety of novel zero-shot applications across image and audio domains, including tasks such as canonical pose generation and generative sound highlighting or removal.

Abstract

Text-guided diffusion models have revolutionized generative tasks by producing high-fidelity content from text descriptions. They have also enabled an editing paradigm where concepts can be replaced through text conditioning (e.g., a dog to a tiger). In this work, we explore a novel approach: instead of replacing a concept, can we enhance or suppress the concept itself? Through an empirical study, we identify a trend where concepts can be decomposed in text-guided diffusion models. Leveraging this insight, we introduce ScalingConcept, a simple yet effective method to scale decomposed concepts up or down in real input without introducing new elements. To systematically evaluate our approach, we present the WeakConcept-10 dataset, where concepts are imperfect and need to be enhanced. More importantly, ScalingConcept enables a variety of novel zero-shot applications across image and audio domains, including tasks such as canonical pose generation and generative sound highlighting or removal.

Paper Structure

This paper contains 20 sections, 7 equations, 17 figures, 2 tables.

Figures (17)

  • Figure 1: Applications of ScalingConcept. We showcase various zero-shot applications across image and audio modalities, highlighting the surprising effects of scaling concepts up or down, including non-trivial tasks such as canonical pose generation and sound modulation, among others.
  • Figure 1: Comparison of different methods for concept enhancement. Our method, ScalingConcept, achieves the best performance in terms of image quality (lower FID score), maintaining original content (lower LPIPS), and comparable concept enhancement (similar CLIP score) to other approaches.
  • Figure 2: (a) Illustration of concept removal capability observed in the sampling process of text-guided diffusion models when conditioning on a conceptually different prompt compared to the inversion process. (b) We compute the CLIP zero-shot classification results between the classes ["a sky", "a church"] and the reconstruction results at each inversion/sampling step (the total number of sampling step is 50), and report the classification accuracy of the class "a church". It's observed that the church object is removed from the removal branch even at the very early stages of sampling.
  • Figure 3: Analysis of the trend of concept removal. We erase target concepts from given images and audio clips using the proposed inversion and sampling process. We report the number of samples with target concepts before and after concept removal.
  • Figure 4: Overview of the ScalingConcept framework. Our method consists of two steps: 1) extracting the latent variable from $\boldsymbol{x}_0$, and 2) constructing different sampling branches and modeling the difference between them.
  • ...and 12 more figures