Table of Contents
Fetching ...

PANDORA: Pixel-wise Attention Dissolution and Latent Guidance for Zero-Shot Object Removal

Dinh-Khoi Vo, Van-Loc Nguyen, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le

Abstract

Removing objects from natural images is challenging due to difficulty of synthesizing semantically coherent content while preserving background integrity. Existing methods often rely on fine-tuning, prompt engineering, or inference-time optimization, yet still suffer from texture inconsistency, rigid artifacts, weak foreground-background disentanglement, and poor scalability for multi-object removal. We propose a novel zero-shot object removal framework, namely PANDORA, that operates directly on pre-trained text-to-image diffusion models, requiring no fine-tuning, prompts, or optimization. We propose Pixel-wise Attention Dissolution to remove object by nullifying the most correlated attention keys for masked pixels, effectively eliminating the object from self-attention flow and allowing background context to dominate reconstruction. We further introduce Localized Attentional Disentanglement Guidance to steer denoising toward latent manifolds favorable to clean object removal. Together, these components enable precise, non-rigid, prompt-free, and scalable multi-object erasure in a single pass. Experiments demonstrate superior visual fidelity and semantic plausibility compared to state-of-the-art methods. The project page is available at https://vdkhoi20.github.io/PANDORA.

PANDORA: Pixel-wise Attention Dissolution and Latent Guidance for Zero-Shot Object Removal

Abstract

Removing objects from natural images is challenging due to difficulty of synthesizing semantically coherent content while preserving background integrity. Existing methods often rely on fine-tuning, prompt engineering, or inference-time optimization, yet still suffer from texture inconsistency, rigid artifacts, weak foreground-background disentanglement, and poor scalability for multi-object removal. We propose a novel zero-shot object removal framework, namely PANDORA, that operates directly on pre-trained text-to-image diffusion models, requiring no fine-tuning, prompts, or optimization. We propose Pixel-wise Attention Dissolution to remove object by nullifying the most correlated attention keys for masked pixels, effectively eliminating the object from self-attention flow and allowing background context to dominate reconstruction. We further introduce Localized Attentional Disentanglement Guidance to steer denoising toward latent manifolds favorable to clean object removal. Together, these components enable precise, non-rigid, prompt-free, and scalable multi-object erasure in a single pass. Experiments demonstrate superior visual fidelity and semantic plausibility compared to state-of-the-art methods. The project page is available at https://vdkhoi20.github.io/PANDORA.

Paper Structure

This paper contains 16 sections, 7 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: Our PANDORA enables prompt-free, fine-tuning-free object removal across various scenarios in a single forward. Without requiring training or textual prompts, our approach handles diverse and challenging removal settings, from a single object to multiple similar or distinct targets and even densely packed similar objects, while preserving background fidelity and structural consistency.
  • Figure 2: Overview of our proposed pipeline with an intuitive illustration of each module. The image is inverted into noise with intermediate latents stored and injected into BPA and PAD to preserve background and dissolve objects, respectively. Specifically, BPA restricts background queries to background regions, while PAD operates at the pixel level to constrain object queries to unrelated regions. Finally, LADG steers denoising away from masked object regions for seamless synthesis.
  • Figure 3: Qualitative comparison across diverse object removal scenarios, including single-object, multi-object, and mass similar-object removal (top to bottom). From left to right: the original image with mask and results from different methods. The last five columns correspond to zero-shot approaches.
  • Figure 4: From left to right, mask is $20\%$ smaller, $10\%$ smaller, tightly aligned, $10\%$ larger, and $20\%$ larger than object.