Table of Contents
Fetching ...

Training-Free Multi-View Extension of IC-Light for Textual Position-Aware Scene Relighting

Jiangnan Ye, Jiedong Zhuang, Lianrui Mu, Wenjie Zheng, Jiaqi Hu, Xingze Zou, Jing Wang, Haoji Hu

TL;DR

GS-Light tackles training-free, position-aware, multi-view relighting of 3D Gaussian Splatting scenes by fusing LVLM-derived lighting priors with geometry/semantics through a Position-Align Module and a cross-view diffusion framework (MV-ICLight). The approach enforces cross-view coherence via improved epipolar constraints and a multi-view attention mechanism, followed by iterative GS-based fine-tuning to converge to a consistent relit scene. It achieves superior objective and perceptual metrics across indoor/outdoor datasets while remaining inference-time efficient (priors ~3 minutes, per-scene relighting ~3 minutes) and without per-scene training of diffusion weights. The work demonstrates practical impact for controllable, text-guided relighting of complex 3D scenes and lays the groundwork for scalable, semantically faithful 3D editing.

Abstract

We introduce GS-Light, an efficient, textual position-aware pipeline for text-guided relighting of 3D scenes represented via Gaussian Splatting (3DGS). GS-Light implements a training-free extension of a single-input diffusion model to handle multi-view inputs. Given a user prompt that may specify lighting direction, color, intensity, or reference objects, we employ a large vision-language model (LVLM) to parse the prompt into lighting priors. Using off-the-shelf estimators for geometry and semantics (depth, surface normals, and semantic segmentation), we fuse these lighting priors with view-geometry constraints to compute illumination maps and generate initial latent codes for each view. These meticulously derived init latents guide the diffusion model to generate relighting outputs that more accurately reflect user expectations, especially in terms of lighting direction. By feeding multi-view rendered images, along with the init latents, into our multi-view relighting model, we produce high-fidelity, artistically relit images. Finally, we fine-tune the 3DGS scene with the relit appearance to obtain a fully relit 3D scene. We evaluate GS-Light on both indoor and outdoor scenes, comparing it to state-of-the-art baselines including per-view relighting, video relighting, and scene editing methods. Using quantitative metrics (multi-view consistency, imaging quality, aesthetic score, semantic similarity, etc.) and qualitative assessment (user studies), GS-Light demonstrates consistent improvements over baselines. Code and assets will be made available upon publication.

Training-Free Multi-View Extension of IC-Light for Textual Position-Aware Scene Relighting

TL;DR

GS-Light tackles training-free, position-aware, multi-view relighting of 3D Gaussian Splatting scenes by fusing LVLM-derived lighting priors with geometry/semantics through a Position-Align Module and a cross-view diffusion framework (MV-ICLight). The approach enforces cross-view coherence via improved epipolar constraints and a multi-view attention mechanism, followed by iterative GS-based fine-tuning to converge to a consistent relit scene. It achieves superior objective and perceptual metrics across indoor/outdoor datasets while remaining inference-time efficient (priors ~3 minutes, per-scene relighting ~3 minutes) and without per-scene training of diffusion weights. The work demonstrates practical impact for controllable, text-guided relighting of complex 3D scenes and lays the groundwork for scalable, semantically faithful 3D editing.

Abstract

We introduce GS-Light, an efficient, textual position-aware pipeline for text-guided relighting of 3D scenes represented via Gaussian Splatting (3DGS). GS-Light implements a training-free extension of a single-input diffusion model to handle multi-view inputs. Given a user prompt that may specify lighting direction, color, intensity, or reference objects, we employ a large vision-language model (LVLM) to parse the prompt into lighting priors. Using off-the-shelf estimators for geometry and semantics (depth, surface normals, and semantic segmentation), we fuse these lighting priors with view-geometry constraints to compute illumination maps and generate initial latent codes for each view. These meticulously derived init latents guide the diffusion model to generate relighting outputs that more accurately reflect user expectations, especially in terms of lighting direction. By feeding multi-view rendered images, along with the init latents, into our multi-view relighting model, we produce high-fidelity, artistically relit images. Finally, we fine-tune the 3DGS scene with the relit appearance to obtain a fully relit 3D scene. We evaluate GS-Light on both indoor and outdoor scenes, comparing it to state-of-the-art baselines including per-view relighting, video relighting, and scene editing methods. Using quantitative metrics (multi-view consistency, imaging quality, aesthetic score, semantic similarity, etc.) and qualitative assessment (user studies), GS-Light demonstrates consistent improvements over baselines. Code and assets will be made available upon publication.

Paper Structure

This paper contains 29 sections, 15 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Qualitative comparison between our GS-Light and other prior work on relighting videos or gaussian splatting scenes. Our GS-Light produces results with higher fidelity and aesthetic, improved multi-view consistency, stronger semantic relevance, and enhanced controllability of lighting direction through text prompts.
  • Figure 2: Circulation pipeline of GS-Light. Starting from a pre-trained Gaussian Splatting (GS) scene and a text prompt specifying the relighting instruction, we first render images from all training views. One training view is selected as the reference view to align the positional information in the prompt. Through our proposed Position-Align Module (PAM), we generate position-aligned light intensity maps for all views. These intensity maps are then provided as initialization latents to our Multi-View ICLight, producing multi-view consistent relit images. Finally, the relit images are used to fine-tune the opacity and color parameters of the GS scene, forming a closed-loop tuning circulation. Repeating this circulation multiple times ensures that the relit GS converges to a stable and consistent result.
  • Figure 3: IC-Light relighting results on different light direction instructions, which show a weak or wrong response towards the position-related information.
  • Figure 4: Details of PAM. Given the rendered views and a text prompt, Qwen2.5-VL is employed with a preset VQA template to parse the user’s intended lighting direction and reference object. Pretrained models VGGT, StableNormal, and Lang-SAM are then applied to estimate the initial light position and scene geometry. By combining these estimates with the parsed light-position offset, PAM produces light-intensity maps that are spatially aligned with the input positional intent across all views.
  • Figure 5: The process of estimating light source position.
  • ...and 4 more figures