Table of Contents
Fetching ...

MVLight: Relightable Text-to-3D Generation via Light-conditioned Multi-View Diffusion

Dongseok Shim, Yichun Shi, Kejie Li, H. Jin Kim, Peng Wang

TL;DR

MVLight is presented, a novel light-conditioned multi-view diffusion model that explicitly integrates lighting conditions directly into the generation process and can effectively synthesize 3D models with improved geometric precision and relighting capabilities.

Abstract

Recent advancements in text-to-3D generation, building on the success of high-performance text-to-image generative models, have made it possible to create imaginative and richly textured 3D objects from textual descriptions. However, a key challenge remains in effectively decoupling light-independent and lighting-dependent components to enhance the quality of generated 3D models and their relighting performance. In this paper, we present MVLight, a novel light-conditioned multi-view diffusion model that explicitly integrates lighting conditions directly into the generation process. This enables the model to synthesize high-quality images that faithfully reflect the specified lighting environment across multiple camera views. By leveraging this capability to Score Distillation Sampling (SDS), we can effectively synthesize 3D models with improved geometric precision and relighting capabilities. We validate the effectiveness of MVLight through extensive experiments and a user study.

MVLight: Relightable Text-to-3D Generation via Light-conditioned Multi-View Diffusion

TL;DR

MVLight is presented, a novel light-conditioned multi-view diffusion model that explicitly integrates lighting conditions directly into the generation process and can effectively synthesize 3D models with improved geometric precision and relighting capabilities.

Abstract

Recent advancements in text-to-3D generation, building on the success of high-performance text-to-image generative models, have made it possible to create imaginative and richly textured 3D objects from textual descriptions. However, a key challenge remains in effectively decoupling light-independent and lighting-dependent components to enhance the quality of generated 3D models and their relighting performance. In this paper, we present MVLight, a novel light-conditioned multi-view diffusion model that explicitly integrates lighting conditions directly into the generation process. This enables the model to synthesize high-quality images that faithfully reflect the specified lighting environment across multiple camera views. By leveraging this capability to Score Distillation Sampling (SDS), we can effectively synthesize 3D models with improved geometric precision and relighting capabilities. We validate the effectiveness of MVLight through extensive experiments and a user study.

Paper Structure

This paper contains 15 sections, 2 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: MVLight Overview. MVLight synthesizes 3D-consistent outputs that appear as if captured under specific lighting conditions, with the light environment provided as input across all camera views. Additionally, MVLight generates three distinct modalities—normal, albedo, and RGB—under the specified lighting conditions, enhancing both geometric accuracy and relighting capabilities. Here, $\mathbf{x}_{T}$ represents random noise input for the diffusion model, $t$ and $\mathbf{\zeta}$ denote denoising timestep and camera poses respectively. $L$ refers to the HDR map, with $L_{hf}$ and $L_{lf}$ indicating its high-frequency and low-frequency components, respectively.
  • Figure 2: 3D Generation with light-aware multi-view SDS. MVLight is integrated into the SDS optimization pipeline, ensuring 3D consistency while enabling the decomposition of light-dependent and light-independent components. The pipeline consists of two stages: the first focuses on synthesizing the overall geometry and appearance, while the second stage refines PBR materials to enhance relighting capabilities.
  • Figure 3: Qualitative results of PBR material decomposition and relighted 3D models. Our proposed method enables better estimation of albedo, metallic, and roughness values, which leads to better relighting capability.
  • Figure 4: User study.
  • Figure 5: Effectiveness of multi-modal SDS on the estimation of normal map and albedo value. 3D models trained with multi-modal SDS produce smoother normal maps with non-bumpy surfaces and more distinct albedo values with accurate, albedo-like color distribution, compared to those trained without multi-modal SDS.
  • ...and 5 more figures