Table of Contents
Fetching ...

Scene-Conditional 3D Object Stylization and Composition

Jinghao Zhou, Tomas Jakab, Philip Torr, Christian Rupprecht

TL;DR

This work tackles the problem of placing and stylizing a 3D object to fit a given 2D scene while achieving photorealistic composition. It introduces a Scene-Conditional 3D Object Stylization and Composition framework that jointly optimizes a textured mesh, neural texture, and lighting through differentiable ray tracing guided by diffusion priors, with scene-aware prompts and a white diffuse sphere to capture environment lighting. Key contributions include: (i) environment-aware texture adaptation via GPT-4–augmented prompts and reference feature injection to preserve object identity, (ii) a blending strategy that leverages global-view inpainting and local-view renders for seamless integration, (iii) indoor/outdoor lighting estimation using HDR environment maps and a light-capturing apparatus, and (iv) comprehensive ablations and comparisons demonstrating robust, controllable scene-object composition across diverse scenes. The method enables realistic, reusable 3D assets for downstream tasks such as video games and media production, offering practical control over appearance and illumination conditioned on the scene.

Abstract

Recently, 3D generative models have made impressive progress, enabling the generation of almost arbitrary 3D assets from text or image inputs. However, these approaches generate objects in isolation without any consideration for the scene where they will eventually be placed. In this paper, we propose a framework that allows for the stylization of an existing 3D asset to fit into a given 2D scene, and additionally produce a photorealistic composition as if the asset was placed within the environment. This not only opens up a new level of control for object stylization, for example, the same assets can be stylized to reflect changes in the environment, such as summer to winter or fantasy versus futuristic settings-but also makes the object-scene composition more controllable. We achieve this by combining modeling and optimizing the object's texture and environmental lighting through differentiable ray tracing with image priors from pre-trained text-to-image diffusion models. We demonstrate that our method is applicable to a wide variety of indoor and outdoor scenes and arbitrary objects. Project page: https://jensenzhoujh.github.io/scene-cond-3d/.

Scene-Conditional 3D Object Stylization and Composition

TL;DR

This work tackles the problem of placing and stylizing a 3D object to fit a given 2D scene while achieving photorealistic composition. It introduces a Scene-Conditional 3D Object Stylization and Composition framework that jointly optimizes a textured mesh, neural texture, and lighting through differentiable ray tracing guided by diffusion priors, with scene-aware prompts and a white diffuse sphere to capture environment lighting. Key contributions include: (i) environment-aware texture adaptation via GPT-4–augmented prompts and reference feature injection to preserve object identity, (ii) a blending strategy that leverages global-view inpainting and local-view renders for seamless integration, (iii) indoor/outdoor lighting estimation using HDR environment maps and a light-capturing apparatus, and (iv) comprehensive ablations and comparisons demonstrating robust, controllable scene-object composition across diverse scenes. The method enables realistic, reusable 3D assets for downstream tasks such as video games and media production, offering practical control over appearance and illumination conditioned on the scene.

Abstract

Recently, 3D generative models have made impressive progress, enabling the generation of almost arbitrary 3D assets from text or image inputs. However, these approaches generate objects in isolation without any consideration for the scene where they will eventually be placed. In this paper, we propose a framework that allows for the stylization of an existing 3D asset to fit into a given 2D scene, and additionally produce a photorealistic composition as if the asset was placed within the environment. This not only opens up a new level of control for object stylization, for example, the same assets can be stylized to reflect changes in the environment, such as summer to winter or fantasy versus futuristic settings-but also makes the object-scene composition more controllable. We achieve this by combining modeling and optimizing the object's texture and environmental lighting through differentiable ray tracing with image priors from pre-trained text-to-image diffusion models. We demonstrate that our method is applicable to a wide variety of indoor and outdoor scenes and arbitrary objects. Project page: https://jensenzhoujh.github.io/scene-cond-3d/.
Paper Structure (76 sections, 3 equations, 18 figures, 5 tables)

This paper contains 76 sections, 3 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: We present a framework that adapts a 3D object's appearance to a location in a 2D scene. It creates an image where the 3D object is seamlessly blended (left & bottom right), with its appearance influenced by the scene's environmental conditions and lighting effects. Moreover, the stylized object with adapted textures (top right), rendered here without the estimated lighting condition for illustrative purposes, can be further utilized as 3D assets for downstream tasks such as video games.
  • Figure 2: Framework. We learn an environment map and a texture map separately from the 2D supervision. We initialize (init.) the environment map with an LDR map estimated from the 2D scene and learn light multiplying scales bright areas, yielding an HDR map. We employ the PBR material model for texture maps, encoded via MLP with positional encoding. The object is rendered through a differentiable ray tracer and further composed with the scene background, receiving gradients from Stable Diffusion (SD) in the latent space.
  • Figure 3: Pipeline for texture adaptation. We initialize the neural texture from the reference object and inject the feature of reference renderings to the U-Net of SD. We use both local-view and global-view guidance.
  • Figure 4: Pipeline for estimating the LDR map. We utilize tailored pipelines for indoor and outdoor scenes. Areas masked red and blue correspond to the far light and near light region.
  • Figure 5: Visual Results.(a) We showcase that our method applies to a diverse range of objects and scenes. The global view (top row) the overall composition quality and object-centric local view (bottom two rows) for the fidelity of stylized textures are demonstrated. For dim scenes, we additionally render objects without the estimated lighting condition (w/o light) for illustrative purposes. Additionally, we showcase that our method applies to both (b) small and (c) big objects, as well as (d) different placing locations. The texture of the television, for example, adjusts the texture to match its surroundings.
  • ...and 13 more figures