Ctrl&Shift: High-Quality Geometry-Aware Object Manipulation in Visual Generation

Penghui Ruan; Bojia Zi; Xianbiao Qi; Youze Huang; Rong Xiao; Pichao Wang; Jiannong Cao; Yuhui Shi

Ctrl&Shift: High-Quality Geometry-Aware Object Manipulation in Visual Generation

Penghui Ruan, Bojia Zi, Xianbiao Qi, Youze Huang, Rong Xiao, Pichao Wang, Jiannong Cao, Yuhui Shi

TL;DR

Ctrl&Shift tackles geometry-consistent object manipulation in images and videos without explicit 3D modeling at inference by injecting relative camera pose control into a diffusion process and decomposing edits into object removal plus reference-guided inpainting. It uses a multi-task, multi-stage training regime to disentangle background, identity, and pose signals, supported by a scalable data pipeline that creates real-world paired samples with estimated relative poses; the relative-pose descriptor f in rac{R}{8} guides viewpoint changes and a flow-matching objective trains the diffusion model. Empirically, it achieves state-of-the-art fidelity, viewpoint consistency, and controllability on ObjectMover-A and GeoEditBench, with strong pose accuracy and background preservation. By avoiding 3D modeling at inference while maintaining precise geometric control, Ctrl&Shift bridges geometry-based rigor and diffusion flexibility, enabling scalable, geometry-aware editing for professional visual content pipelines.

Abstract

Object-level manipulation, relocating or reorienting objects in images or videos while preserving scene realism, is central to film post-production, AR, and creative editing. Yet existing methods struggle to jointly achieve three core goals: background preservation, geometric consistency under viewpoint shifts, and user-controllable transformations. Geometry-based approaches offer precise control but require explicit 3D reconstruction and generalize poorly; diffusion-based methods generalize better but lack fine-grained geometric control. We present Ctrl&Shift, an end-to-end diffusion framework to achieve geometry-consistent object manipulation without explicit 3D representations. Our key insight is to decompose manipulation into two stages, object removal and reference-guided inpainting under explicit camera pose control, and encode both within a unified diffusion process. To enable precise, disentangled control, we design a multi-task, multi-stage training strategy that separates background, identity, and pose signals across tasks. To improve generalization, we introduce a scalable real-world dataset construction pipeline that generates paired image and video samples with estimated relative camera poses. Extensive experiments demonstrate that Ctrl&Shift achieves state-of-the-art results in fidelity, viewpoint consistency, and controllability. To our knowledge, this is the first framework to unify fine-grained geometric control and real-world generalization for object manipulation, without relying on any explicit 3D modeling.

Ctrl&Shift: High-Quality Geometry-Aware Object Manipulation in Visual Generation

TL;DR

Abstract

Paper Structure (23 sections, 15 equations, 11 figures, 3 tables)

This paper contains 23 sections, 15 equations, 11 figures, 3 tables.

Introduction
Related Work
Method
Network Architecture
Mask Encoding
Camera Pose Encoding
Multi-Task Multi-Stage Training
Dataset Construction
Experiments
Limitations and Future Work
Conclusion
Appendix
LLM Usage
Training Details
Object Pasting Model
...and 8 more sections

Figures (11)

Figure 1: Results of Ctrl&Shift.Ctrl&Shift demonstrates its superior capability (controllability, plausibility and consistency) on tasks including (a) precise object manipulation, (b) visual object removal, and (c) reference image inpainting with precise camera pose control.
Figure 2: The proposed Ctrl&Shift framework employs a multi-task, multi-stage training paradigm integrating object manipulation, removal, and reference inpainting with explicit camera control. Stage 1 focuses on acquiring object priors and camera control; Stage 2 emphasizes background preservation through fine-tuning on high-quality data. See Appendix. \ref{['sec:architecture']} for detailed architecture.
Figure 3: Overview of the architecture.
Figure 4: Construction of data pairs $(\mathbf{X}^{\text{src}}, \mathbf{s}^{\text{src}})$ and $(\mathbf{X}^{\text{tgt}}, \mathbf{s}^{\text{tgt}})$. From $\mathbf{X}^{\text{src}}$, an image-to-mesh model reconstructs the object mesh, and $\mathbf{s}^{\text{src}}$ is estimated via differentiable rasterization. The target pose $\mathbf{s}^{\text{tgt}}$ is sampled, the object is rendered using the mesh, and an object pasting model generates $\mathbf{X}^{\text{tgt}}$. Our pipeline supports both image and video data synthesis, as the object pasting model is a reference-image inpainting model capable of editing both image and video. For video inputs, the image-to-mesh reconstruction, camera pose estimation, and rendering are all performed on the first frame.
Figure 5: Qualitative comparisons for object manipulation, displaying relative camera changes and NDC shifts. Our model outperforms state-of-the-art methods in background preservation, precise camera pose control, and geometric consistency.
...and 6 more figures

Ctrl&Shift: High-Quality Geometry-Aware Object Manipulation in Visual Generation

TL;DR

Abstract

Ctrl&Shift: High-Quality Geometry-Aware Object Manipulation in Visual Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (11)