Table of Contents
Fetching ...

SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing

Yicheng Xiao, Wenhu Zhang, Lin Song, Yukang Chen, Wenbo Li, Nan Jiang, Tianhe Ren, Haokun Lin, Wei Huang, Haoyang Huang, Xiu Li, Nan Duan, Xiaojuan Qi

Abstract

Image spatial editing performs geometry-driven transformations, allowing precise control over object layout and camera viewpoints. Current models are insufficient for fine-grained spatial manipulations, motivating a dedicated assessment suite. Our contributions are listed: (i) We introduce SpatialEdit-Bench, a complete benchmark that evaluates spatial editing by jointly measuring perceptual plausibility and geometric fidelity via viewpoint reconstruction and framing analysis. (ii) To address the data bottleneck for scalable training, we construct SpatialEdit-500k, a synthetic dataset generated with a controllable Blender pipeline that renders objects across diverse backgrounds and systematic camera trajectories, providing precise ground-truth transformations for both object- and camera-centric operations. (iii) Building on this data, we develop SpatialEdit-16B, a baseline model for fine-grained spatial editing. Our method achieves competitive performance on general editing while substantially outperforming prior methods on spatial manipulation tasks. All resources will be made public at https://github.com/EasonXiao-888/SpatialEdit.

SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing

Abstract

Image spatial editing performs geometry-driven transformations, allowing precise control over object layout and camera viewpoints. Current models are insufficient for fine-grained spatial manipulations, motivating a dedicated assessment suite. Our contributions are listed: (i) We introduce SpatialEdit-Bench, a complete benchmark that evaluates spatial editing by jointly measuring perceptual plausibility and geometric fidelity via viewpoint reconstruction and framing analysis. (ii) To address the data bottleneck for scalable training, we construct SpatialEdit-500k, a synthetic dataset generated with a controllable Blender pipeline that renders objects across diverse backgrounds and systematic camera trajectories, providing precise ground-truth transformations for both object- and camera-centric operations. (iii) Building on this data, we develop SpatialEdit-16B, a baseline model for fine-grained spatial editing. Our method achieves competitive performance on general editing while substantially outperforming prior methods on spatial manipulation tasks. All resources will be made public at https://github.com/EasonXiao-888/SpatialEdit.

Paper Structure

This paper contains 33 sections, 10 equations, 9 figures, 6 tables, 2 algorithms.

Figures (9)

  • Figure 1: Illustration for image spatial editing. It comprises two components: (1) camera-centric view manipulation, including pitch, yaw, and zoom transformations; and (2) single-object manipulation, encompassing object rotation while preserving the background, as well as translation and scaling of objects specified via user-defined bounding boxes.
  • Figure 2: Statistics of SpatialEdit-500k. (a) Distribution of camera-level data across seven sub-tasks in outdoor and intdoor scenes, where Y, P, and D denote Yaw, Pitch, and Distance, respectively. (b) Aspect ratio distribution of bounding boxes for the moving task at the object level. (c) Object category statistics across the entire dataset.
  • Figure 3: SpatialEdit-500k data generation pipeline. We leverage Blender to synthesize both objects and scenes, while preprocessing 3D assets using SAM3 and a vision-language model. The object-level engine constructs two inpainting-based data branches to generate object transformations, including rotation, translation, and scaling. The camera-level engine produces viewpoint transformation data by sampling different camera poses, resulting in variations in yaw, pitch, and zoom.
  • Figure 4: Overview of SpatialEdit.
  • Figure 5: Comparison of camera view manipulation across various methods.
  • ...and 4 more figures