Table of Contents
Fetching ...

SceneDesigner: Controllable Multi-Object Image Generation with 9-DoF Pose Manipulation

Zhenyuan Qin, Xincheng Shuai, Henghui Ding

TL;DR

SceneDesigner addresses the problem of controlling 9D poses for multiple objects within a single image by introducing CNOCS, a cuboid-based pose encoding that preserves geometric cues, and by training a ControlNet-like branched diffusion model with a two-stage RL finetuning objective. It also contributes ObjectPose9D, a diverse 9D pose-annotated dataset, and Disentangled Object Sampling to mitigate multi-object concept confusion during inference, with personalization weights enabling user-specific pose control. Quantitative and qualitative results show SceneDesigner achieves higher pose accuracy, spatial fidelity, and image quality than prior 3D-aware controllable-generation methods across single- and multi-object scenarios, while maintaining reasonable inference efficiency. The work advances practical 3D-aware content creation for design, AR/VR, and related applications, though it acknowledges limitations in precise shape control and potential misuse, proposing safeguards and future improvements.

Abstract

Controllable image generation has attracted increasing attention in recent years, enabling users to manipulate visual content such as identity and style. However, achieving simultaneous control over the 9D poses (location, size, and orientation) of multiple objects remains an open challenge. Despite recent progress, existing methods often suffer from limited controllability and degraded quality, falling short of comprehensive multi-object 9D pose control. To address these limitations, we propose SceneDesigner, a method for accurate and flexible multi-object 9-DoF pose manipulation. SceneDesigner incorporates a branched network to the pre-trained base model and leverages a new representation, CNOCS map, which encodes 9D pose information from the camera view. This representation exhibits strong geometric interpretation properties, leading to more efficient and stable training. To support training, we construct a new dataset, ObjectPose9D, which aggregates images from diverse sources along with 9D pose annotations. To further address data imbalance issues, particularly performance degradation on low-frequency poses, we introduce a two-stage training strategy with reinforcement learning, where the second stage fine-tunes the model using a reward-based objective on rebalanced data. At inference time, we propose Disentangled Object Sampling, a technique that mitigates insufficient object generation and concept confusion in complex multi-object scenes. Moreover, by integrating user-specific personalization weights, SceneDesigner enables customized pose control for reference subjects. Extensive qualitative and quantitative experiments demonstrate that SceneDesigner significantly outperforms existing approaches in both controllability and quality. Code is publicly available at https://github.com/FudanCVL/SceneDesigner.

SceneDesigner: Controllable Multi-Object Image Generation with 9-DoF Pose Manipulation

TL;DR

SceneDesigner addresses the problem of controlling 9D poses for multiple objects within a single image by introducing CNOCS, a cuboid-based pose encoding that preserves geometric cues, and by training a ControlNet-like branched diffusion model with a two-stage RL finetuning objective. It also contributes ObjectPose9D, a diverse 9D pose-annotated dataset, and Disentangled Object Sampling to mitigate multi-object concept confusion during inference, with personalization weights enabling user-specific pose control. Quantitative and qualitative results show SceneDesigner achieves higher pose accuracy, spatial fidelity, and image quality than prior 3D-aware controllable-generation methods across single- and multi-object scenarios, while maintaining reasonable inference efficiency. The work advances practical 3D-aware content creation for design, AR/VR, and related applications, though it acknowledges limitations in precise shape control and potential misuse, proposing safeguards and future improvements.

Abstract

Controllable image generation has attracted increasing attention in recent years, enabling users to manipulate visual content such as identity and style. However, achieving simultaneous control over the 9D poses (location, size, and orientation) of multiple objects remains an open challenge. Despite recent progress, existing methods often suffer from limited controllability and degraded quality, falling short of comprehensive multi-object 9D pose control. To address these limitations, we propose SceneDesigner, a method for accurate and flexible multi-object 9-DoF pose manipulation. SceneDesigner incorporates a branched network to the pre-trained base model and leverages a new representation, CNOCS map, which encodes 9D pose information from the camera view. This representation exhibits strong geometric interpretation properties, leading to more efficient and stable training. To support training, we construct a new dataset, ObjectPose9D, which aggregates images from diverse sources along with 9D pose annotations. To further address data imbalance issues, particularly performance degradation on low-frequency poses, we introduce a two-stage training strategy with reinforcement learning, where the second stage fine-tunes the model using a reward-based objective on rebalanced data. At inference time, we propose Disentangled Object Sampling, a technique that mitigates insufficient object generation and concept confusion in complex multi-object scenes. Moreover, by integrating user-specific personalization weights, SceneDesigner enables customized pose control for reference subjects. Extensive qualitative and quantitative experiments demonstrate that SceneDesigner significantly outperforms existing approaches in both controllability and quality. Code is publicly available at https://github.com/FudanCVL/SceneDesigner.

Paper Structure

This paper contains 21 sections, 4 equations, 13 figures, 5 tables, 2 algorithms.

Figures (13)

  • Figure 1: 9D pose control results of the SceneDesigner. The figures show the applications in single-object, multi-object, and customization scenarios, exhibiting high quality, flexibility and fidelity.
  • Figure 2: Overview of SceneDesigner.
  • Figure 3: Illustration of CNOCS map.
  • Figure 4: Annotation pipeline of 9D poses in MS-COCO COCO.
  • Figure 5: Evaluation of 9D pose control in single- and multi-object scenarios. SceneDesigner outperforms other methods in both fidelity and quality under various pose conditions.
  • ...and 8 more figures