Table of Contents
Fetching ...

Ground-A-Score: Scaling Up the Score Distillation for Multi-Attribute Editing

Hangeol Chang, Jinho Chang, Jong Chul Ye

TL;DR

Ground-A-Score is presented, a simple yet powerful model-agnostic image editing method by incorporating grounding during score distillation that ensures a precise reflection of intricate prompt requirements in the editing outcomes, taking into account the prior knowledge of the object locations within the image.

Abstract

Despite recent advancements in text-to-image diffusion models facilitating various image editing techniques, complex text prompts often lead to an oversight of some requests due to a bottleneck in processing text information. To tackle this challenge, we present Ground-A-Score, a simple yet powerful model-agnostic image editing method by incorporating grounding during score distillation. This approach ensures a precise reflection of intricate prompt requirements in the editing outcomes, taking into account the prior knowledge of the object locations within the image. Moreover, the selective application with a new penalty coefficient and contrastive loss helps to precisely target editing areas while preserving the integrity of the objects in the source image. Both qualitative assessments and quantitative analyses confirm that Ground-A-Score successfully adheres to the intricate details of extended and multifaceted prompts, ensuring high-quality outcomes that respect the original image attributes.

Ground-A-Score: Scaling Up the Score Distillation for Multi-Attribute Editing

TL;DR

Ground-A-Score is presented, a simple yet powerful model-agnostic image editing method by incorporating grounding during score distillation that ensures a precise reflection of intricate prompt requirements in the editing outcomes, taking into account the prior knowledge of the object locations within the image.

Abstract

Despite recent advancements in text-to-image diffusion models facilitating various image editing techniques, complex text prompts often lead to an oversight of some requests due to a bottleneck in processing text information. To tackle this challenge, we present Ground-A-Score, a simple yet powerful model-agnostic image editing method by incorporating grounding during score distillation. This approach ensures a precise reflection of intricate prompt requirements in the editing outcomes, taking into account the prior knowledge of the object locations within the image. Moreover, the selective application with a new penalty coefficient and contrastive loss helps to precisely target editing areas while preserving the integrity of the objects in the source image. Both qualitative assessments and quantitative analyses confirm that Ground-A-Score successfully adheres to the intricate details of extended and multifaceted prompts, ensuring high-quality outcomes that respect the original image attributes.
Paper Structure (24 sections, 9 equations, 6 figures, 6 tables)

This paper contains 24 sections, 9 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: The overview of the proposed pipeline for image editing with complex user requests. (a) We leverage the prior knowledge from the multimodal LLM and the zero-shot grounding model to break down the user request into multiple image editing subtasks for a single entity. (b) A pre-trained text-to-image diffusion model is used for each subtask to obtain a corresponding gradient for the source image. These gradients are masked and aggregated to get a total gradient that is efficient and stable.
  • Figure 1: The effect of changing hyperparameter $\eta$ for the null-text penalty. (a) The plot of the null-text penalty coefficient $\gamma$ against the difference between $\hat{\epsilon}^{\omega}_{\phi} (z_t,t,y_k)$ and $\hat{\epsilon}_{\phi} (z_t,t,\varnothing)$ with different $\eta$. The dotted line of $\gamma=1$ represents the scenario without a null-text penalty. (b) Ground-A-Score image editing outputs using three different $\eta$ with a shared source image and editing prompt. The images were optimized with the same number of optimization steps and the optimization step size.
  • Figure 2: The difference between the predicted noise on the target image, with the given condition and null text, in two image editing scenarios. The source image and the output from DDS are also provided. The red boxes indicate the region corresponding to the object meant to be edited.
  • Figure 2: The image editing results from Ground-A-Score and DDS, with synthetic image editing scenarios and editing prompts.
  • Figure 3: The benchmark result of Ground-A-Score with other baseline models using the same editing prompts.
  • ...and 1 more figures