Point and Instruct: Enabling Precise Image Editing by Unifying Direct Manipulation and Text Instructions
Alec Helbling, Seongmin Lee, Polo Chau
TL;DR
The paper addresses the challenge of achieving precise image edits when text prompts alone are insufficient to disambiguate targets or specify exact locations. It introduces Point & Instruct, a web-based multimodal interface that unifies direct manipulation (bounding boxes, points) with natural-language instructions, and processes them via an LLM to produce a transformed image layout. The transformed layout guides a layout-based diffusion generation pipeline (GLIGEN and related tools) to edit the image, with optional subject-specific fine-tuning for consistency. The work demonstrates improved precision over text-only editing baselines and outlines ongoing user studies and evaluation plans to quantify usability and editing accuracy.
Abstract
Machine learning has enabled the development of powerful systems capable of editing images from natural language instructions. However, in many common scenarios it is difficult for users to specify precise image transformations with text alone. For example, in an image with several dogs, it is difficult to select a particular dog and move it to a precise location. Doing this with text alone would require a complex prompt that disambiguates the target dog and describes the destination. However, direct manipulation is well suited to visual tasks like selecting objects and specifying locations. We introduce Point and Instruct, a system for seamlessly combining familiar direct manipulation and textual instructions to enable precise image manipulation. With our system, a user can visually mark objects and locations, and reference them in textual instructions. This allows users to benefit from both the visual descriptiveness of natural language and the spatial precision of direct manipulation.
