Table of Contents
Fetching ...

Localize, Understand, Collaborate: Semantic-Aware Dragging via Intention Reasoner

Xing Cui, Peipei Li, Zekun Li, Xuannan Liu, Yueying Zou, Zhaofeng He

TL;DR

Qualitative and quantitative comparisons demonstrate the superiority of LucidDrag over previous methods, which addresses"how to drag" by collaboratively integrating existing editing guidance with the newly proposed semantic guidance and quality guidance.

Abstract

Flexible and accurate drag-based editing is a challenging task that has recently garnered significant attention. Current methods typically model this problem as automatically learning "how to drag" through point dragging and often produce one deterministic estimation, which presents two key limitations: 1) Overlooking the inherently ill-posed nature of drag-based editing, where multiple results may correspond to a given input, as illustrated in Fig.1; 2) Ignoring the constraint of image quality, which may lead to unexpected distortion. To alleviate this, we propose LucidDrag, which shifts the focus from "how to drag" to "what-then-how" paradigm. LucidDrag comprises an intention reasoner and a collaborative guidance sampling mechanism. The former infers several optimal editing strategies, identifying what content and what semantic direction to be edited. Based on the former, the latter addresses "how to drag" by collaboratively integrating existing editing guidance with the newly proposed semantic guidance and quality guidance. Specifically, semantic guidance is derived by establishing a semantic editing direction based on reasoned intentions, while quality guidance is achieved through classifier guidance using an image fidelity discriminator. Both qualitative and quantitative comparisons demonstrate the superiority of LucidDrag over previous methods.

Localize, Understand, Collaborate: Semantic-Aware Dragging via Intention Reasoner

TL;DR

Qualitative and quantitative comparisons demonstrate the superiority of LucidDrag over previous methods, which addresses"how to drag" by collaboratively integrating existing editing guidance with the newly proposed semantic guidance and quality guidance.

Abstract

Flexible and accurate drag-based editing is a challenging task that has recently garnered significant attention. Current methods typically model this problem as automatically learning "how to drag" through point dragging and often produce one deterministic estimation, which presents two key limitations: 1) Overlooking the inherently ill-posed nature of drag-based editing, where multiple results may correspond to a given input, as illustrated in Fig.1; 2) Ignoring the constraint of image quality, which may lead to unexpected distortion. To alleviate this, we propose LucidDrag, which shifts the focus from "how to drag" to "what-then-how" paradigm. LucidDrag comprises an intention reasoner and a collaborative guidance sampling mechanism. The former infers several optimal editing strategies, identifying what content and what semantic direction to be edited. Based on the former, the latter addresses "how to drag" by collaboratively integrating existing editing guidance with the newly proposed semantic guidance and quality guidance. Specifically, semantic guidance is derived by establishing a semantic editing direction based on reasoned intentions, while quality guidance is achieved through classifier guidance using an image fidelity discriminator. Both qualitative and quantitative comparisons demonstrate the superiority of LucidDrag over previous methods.
Paper Structure (42 sections, 16 equations, 11 figures, 8 tables, 1 algorithm)

This paper contains 42 sections, 16 equations, 11 figures, 8 tables, 1 algorithm.

Figures (11)

  • Figure 1: Given an input image, the user draws a mask specifying the editable region and clicks dragging points (handle points (red) and target points (blue)). Our LucidDrag considers the ill-posed nature of drag-based editing and can produce diverse results (the first row). Besides, it achieves outstanding performance in editing accuracy and image fidelity (the second row).
  • Figure 2: Overview of LucidDrag. LucidDrag comprises two main components: an intention reasoner and a collaborative guidance sampling mechanism. Intention Reasoner leverages an LVLM and an LLM to reason $N$ possible semantic intentions. Collaborative Guidance Sampling facilitates semantic-aware editing by collaborating editing guidance with semantic guidance and quality guidance.
  • Figure 3: LucidDrag allows generating diverse results conforming to the intention.
  • Figure 4: Qualitative comparison between our LucidDrag and other methods in object moving.
  • Figure 5: More qualitative comparison on content dragging.
  • ...and 6 more figures