Table of Contents
Fetching ...

AbracADDbra: Touch-Guided Object Addition by Decoupling Placement and Editing Subtasks

Kunal Swami, Raghu Chittersu, Yuvraj Rathore, Rajeev Irny, Shashavali Doodekula, Alok Shukla

TL;DR

This work introduces AbracADDbra, a user-friendly framework that leverages intuitive touch priors to spatially ground succinct instructions for precise placement, and reveals a strong correlation between initial placement accuracy and final edit quality, validating the decoupled approach.

Abstract

Instruction-based object addition is often hindered by the ambiguity of text-only prompts or the tedious nature of mask-based inputs. To address this usability gap, we introduce AbracADDbra, a user-friendly framework that leverages intuitive touch priors to spatially ground succinct instructions for precise placement. Our efficient, decoupled architecture uses a vision-language transformer for touch-guided placement, followed by a diffusion model that jointly generates the object and an instance mask for high-fidelity blending. To facilitate standardized evaluation, we contribute the Touch2Add benchmark for this interactive task. Our extensive evaluations, where our placement model significantly outperforms both random placement and general-purpose VLM baselines, confirm the framework's ability to produce high-fidelity edits. Furthermore, our analysis reveals a strong correlation between initial placement accuracy and final edit quality, validating our decoupled approach. This work thus paves the way for more accessible and efficient creative tools.

AbracADDbra: Touch-Guided Object Addition by Decoupling Placement and Editing Subtasks

TL;DR

This work introduces AbracADDbra, a user-friendly framework that leverages intuitive touch priors to spatially ground succinct instructions for precise placement, and reveals a strong correlation between initial placement accuracy and final edit quality, validating the decoupled approach.

Abstract

Instruction-based object addition is often hindered by the ambiguity of text-only prompts or the tedious nature of mask-based inputs. To address this usability gap, we introduce AbracADDbra, a user-friendly framework that leverages intuitive touch priors to spatially ground succinct instructions for precise placement. Our efficient, decoupled architecture uses a vision-language transformer for touch-guided placement, followed by a diffusion model that jointly generates the object and an instance mask for high-fidelity blending. To facilitate standardized evaluation, we contribute the Touch2Add benchmark for this interactive task. Our extensive evaluations, where our placement model significantly outperforms both random placement and general-purpose VLM baselines, confirm the framework's ability to produce high-fidelity edits. Furthermore, our analysis reveals a strong correlation between initial placement accuracy and final edit quality, validating our decoupled approach. This work thus paves the way for more accessible and efficient creative tools.
Paper Structure (17 sections, 1 equation, 7 figures, 6 tables)

This paper contains 17 sections, 1 equation, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Main idea of AbracADDbra. We propose a new framework for adding objects to images that lets users provide simple touch input along with instructions, making editing more accurate and user-friendly.
  • Figure 2: AbracADDbra performs high-fidelity object addition via touch and succinct instructions. Our method combines an intuitive touch prior with a simple, succinct prompt (in green) to achieve precise object addition. For a fair comparison against strong baselines, we provided them with detailed prompts (in blue).
  • Figure 3: Our automated data generation pipeline. The process includes four stages: (1) Preprocessing: Filtering COCO objects based on size, boundary proximity, and CLIP score. (2) Inpainting: A two-stage process using LaMa lama_inpainting_wacv2022 and Stable Diffusion stablediffusion_cvpr2022. (3) Postprocessing: Applying filtering steps from pbyi_cvpr2025. (4) Caption Generation: Using GLaMM glamm_cvpr2024 and an LLM to create the instruction and placement reasoning.
  • Figure 4: The detailed architecture of our method. Inference scenario is shown and VAE encoder and decoder are omitted.
  • Figure 5: Diversity statistics of our Touch2Add dataset. We also compare the average scene complexity with the MagicBrush object addition subset.
  • ...and 2 more figures