Table of Contents
Fetching ...

Yesnt: Are Diffusion Relighting Models Ready for Capture Stage Compositing? A Hybrid Alternative to Bridge the Gap

Elisabeth Jüttner, Leona Krath, Stefan Korfhage, Hannah Dröge, Matthias B. Hullin, Markus Plack

TL;DR

The paper tackles production-ready volumetric video relighting by marrying diffusion-derived material priors with temporal regularization and physically based rendering. It leverages Gaussian Opacity Fields for novel-view synthesis and a proxy mesh to render indirect effects, while estimating roughness and metallic maps via a diffusion decomposition with flow-guided smoothing. The approach yields temporally stable, high-fidelity relighting across real and synthetic data, outperforming diffusion-only baselines and GOF, and remains scalable beyond clip lengths feasible for video diffusion. By balancing learned priors with physically grounded constraints, the method provides a practical bridge toward production pipelines for volumetric capture relighting.

Abstract

Volumetric video relighting is essential for bringing captured performances into virtual worlds, but current approaches struggle to deliver temporally stable, production-ready results. Diffusion-based intrinsic decomposition methods show promise for single frames, yet suffer from stochastic noise and instability when extended to sequences, while video diffusion models remain constrained by memory and scale. We propose a hybrid relighting framework that combines diffusion-derived material priors with temporal regularization and physically motivated rendering. Our method aggregates multiple stochastic estimates of per-frame material properties into temporally consistent shading components, using optical-flow-guided regularization. For indirect effects such as shadows and reflections, we extract a mesh proxy from Gaussian Opacity Fields and render it within a standard graphics pipeline. Experiments on real and synthetic captures show that this hybrid strategy achieves substantially more stable relighting across sequences than diffusion-only baselines, while scaling beyond the clip lengths feasible for video diffusion. These results indicate that hybrid approaches, which balance learned priors with physically grounded constraints, are a practical step toward production-ready volumetric video relighting.

Yesnt: Are Diffusion Relighting Models Ready for Capture Stage Compositing? A Hybrid Alternative to Bridge the Gap

TL;DR

The paper tackles production-ready volumetric video relighting by marrying diffusion-derived material priors with temporal regularization and physically based rendering. It leverages Gaussian Opacity Fields for novel-view synthesis and a proxy mesh to render indirect effects, while estimating roughness and metallic maps via a diffusion decomposition with flow-guided smoothing. The approach yields temporally stable, high-fidelity relighting across real and synthetic data, outperforming diffusion-only baselines and GOF, and remains scalable beyond clip lengths feasible for video diffusion. By balancing learned priors with physically grounded constraints, the method provides a practical bridge toward production pipelines for volumetric capture relighting.

Abstract

Volumetric video relighting is essential for bringing captured performances into virtual worlds, but current approaches struggle to deliver temporally stable, production-ready results. Diffusion-based intrinsic decomposition methods show promise for single frames, yet suffer from stochastic noise and instability when extended to sequences, while video diffusion models remain constrained by memory and scale. We propose a hybrid relighting framework that combines diffusion-derived material priors with temporal regularization and physically motivated rendering. Our method aggregates multiple stochastic estimates of per-frame material properties into temporally consistent shading components, using optical-flow-guided regularization. For indirect effects such as shadows and reflections, we extract a mesh proxy from Gaussian Opacity Fields and render it within a standard graphics pipeline. Experiments on real and synthetic captures show that this hybrid strategy achieves substantially more stable relighting across sequences than diffusion-only baselines, while scaling beyond the clip lengths feasible for video diffusion. These results indicate that hybrid approaches, which balance learned priors with physically grounded constraints, are a practical step toward production-ready volumetric video relighting.

Paper Structure

This paper contains 19 sections, 7 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: We integrate diffusion-based decomposition priors with variational methods and image-based lighting to achieve physically plausible, temporally stable relighting of captured volumetric content and seamless compositing into diverse virtual environments
  • Figure 2: We optimize a Gaussian Opacity Field yu2024gaussian from multi-view captures to render RGB, depth and normal maps for the novel views and extract a proxy mesh (left). Using a diffusion decomposition model liang2025diffusion we extract roughness and metallic maps, which we smooth using an optical-flow guided temporal regularization (top). We render the proxy geometry as a shadow caster in the 3d scene (bottom) and blend it with out screen space rendered image (right).
  • Figure 3: Qualitative comparison with recent state-of-the-art relighting methods. Each row corresponds to a different scene under novel lighting conditions. Top panels show the full relit view, while bottom panels provide close-up crops.
  • Figure 4: Comparison of relighting quality between GOF meshes yu2024gaussian (left) and our method (right). GOF often approximates natural regions such as the arm with piecewise planar surfaces, whereas our approach produces smoother reconstructions and cleaner alpha mattes in challenging areas like hair.
  • Figure 5: Qualitative results on the synthetic dataset, showing ground truth (top) and our method (bottom).
  • ...and 1 more figures