Table of Contents
Fetching ...

Ctrl&Shift: High-Quality Geometry-Aware Object Manipulation in Visual Generation

Penghui Ruan, Bojia Zi, Xianbiao Qi, Youze Huang, Rong Xiao, Pichao Wang, Jiannong Cao, Yuhui Shi

TL;DR

Ctrl&Shift tackles geometry-consistent object manipulation in images and videos without explicit 3D modeling at inference by injecting relative camera pose control into a diffusion process and decomposing edits into object removal plus reference-guided inpainting. It uses a multi-task, multi-stage training regime to disentangle background, identity, and pose signals, supported by a scalable data pipeline that creates real-world paired samples with estimated relative poses; the relative-pose descriptor f in rac{R}{8} guides viewpoint changes and a flow-matching objective trains the diffusion model. Empirically, it achieves state-of-the-art fidelity, viewpoint consistency, and controllability on ObjectMover-A and GeoEditBench, with strong pose accuracy and background preservation. By avoiding 3D modeling at inference while maintaining precise geometric control, Ctrl&Shift bridges geometry-based rigor and diffusion flexibility, enabling scalable, geometry-aware editing for professional visual content pipelines.

Abstract

Object-level manipulation, relocating or reorienting objects in images or videos while preserving scene realism, is central to film post-production, AR, and creative editing. Yet existing methods struggle to jointly achieve three core goals: background preservation, geometric consistency under viewpoint shifts, and user-controllable transformations. Geometry-based approaches offer precise control but require explicit 3D reconstruction and generalize poorly; diffusion-based methods generalize better but lack fine-grained geometric control. We present Ctrl&Shift, an end-to-end diffusion framework to achieve geometry-consistent object manipulation without explicit 3D representations. Our key insight is to decompose manipulation into two stages, object removal and reference-guided inpainting under explicit camera pose control, and encode both within a unified diffusion process. To enable precise, disentangled control, we design a multi-task, multi-stage training strategy that separates background, identity, and pose signals across tasks. To improve generalization, we introduce a scalable real-world dataset construction pipeline that generates paired image and video samples with estimated relative camera poses. Extensive experiments demonstrate that Ctrl&Shift achieves state-of-the-art results in fidelity, viewpoint consistency, and controllability. To our knowledge, this is the first framework to unify fine-grained geometric control and real-world generalization for object manipulation, without relying on any explicit 3D modeling.

Ctrl&Shift: High-Quality Geometry-Aware Object Manipulation in Visual Generation

TL;DR

Ctrl&Shift tackles geometry-consistent object manipulation in images and videos without explicit 3D modeling at inference by injecting relative camera pose control into a diffusion process and decomposing edits into object removal plus reference-guided inpainting. It uses a multi-task, multi-stage training regime to disentangle background, identity, and pose signals, supported by a scalable data pipeline that creates real-world paired samples with estimated relative poses; the relative-pose descriptor f in rac{R}{8} guides viewpoint changes and a flow-matching objective trains the diffusion model. Empirically, it achieves state-of-the-art fidelity, viewpoint consistency, and controllability on ObjectMover-A and GeoEditBench, with strong pose accuracy and background preservation. By avoiding 3D modeling at inference while maintaining precise geometric control, Ctrl&Shift bridges geometry-based rigor and diffusion flexibility, enabling scalable, geometry-aware editing for professional visual content pipelines.

Abstract

Object-level manipulation, relocating or reorienting objects in images or videos while preserving scene realism, is central to film post-production, AR, and creative editing. Yet existing methods struggle to jointly achieve three core goals: background preservation, geometric consistency under viewpoint shifts, and user-controllable transformations. Geometry-based approaches offer precise control but require explicit 3D reconstruction and generalize poorly; diffusion-based methods generalize better but lack fine-grained geometric control. We present Ctrl&Shift, an end-to-end diffusion framework to achieve geometry-consistent object manipulation without explicit 3D representations. Our key insight is to decompose manipulation into two stages, object removal and reference-guided inpainting under explicit camera pose control, and encode both within a unified diffusion process. To enable precise, disentangled control, we design a multi-task, multi-stage training strategy that separates background, identity, and pose signals across tasks. To improve generalization, we introduce a scalable real-world dataset construction pipeline that generates paired image and video samples with estimated relative camera poses. Extensive experiments demonstrate that Ctrl&Shift achieves state-of-the-art results in fidelity, viewpoint consistency, and controllability. To our knowledge, this is the first framework to unify fine-grained geometric control and real-world generalization for object manipulation, without relying on any explicit 3D modeling.
Paper Structure (23 sections, 15 equations, 11 figures, 3 tables)

This paper contains 23 sections, 15 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Results of Ctrl&Shift.Ctrl&Shift demonstrates its superior capability (controllability, plausibility and consistency) on tasks including (a) precise object manipulation, (b) visual object removal, and (c) reference image inpainting with precise camera pose control.
  • Figure 2: The proposed Ctrl&Shift framework employs a multi-task, multi-stage training paradigm integrating object manipulation, removal, and reference inpainting with explicit camera control. Stage 1 focuses on acquiring object priors and camera control; Stage 2 emphasizes background preservation through fine-tuning on high-quality data. See Appendix. \ref{['sec:architecture']} for detailed architecture.
  • Figure 3: Overview of the architecture.
  • Figure 4: Construction of data pairs $(\mathbf{X}^{\text{src}}, \mathbf{s}^{\text{src}})$ and $(\mathbf{X}^{\text{tgt}}, \mathbf{s}^{\text{tgt}})$. From $\mathbf{X}^{\text{src}}$, an image-to-mesh model reconstructs the object mesh, and $\mathbf{s}^{\text{src}}$ is estimated via differentiable rasterization. The target pose $\mathbf{s}^{\text{tgt}}$ is sampled, the object is rendered using the mesh, and an object pasting model generates $\mathbf{X}^{\text{tgt}}$. Our pipeline supports both image and video data synthesis, as the object pasting model is a reference-image inpainting model capable of editing both image and video. For video inputs, the image-to-mesh reconstruction, camera pose estimation, and rendering are all performed on the first frame.
  • Figure 5: Qualitative comparisons for object manipulation, displaying relative camera changes and NDC shifts. Our model outperforms state-of-the-art methods in background preservation, precise camera pose control, and geometric consistency.
  • ...and 6 more figures