Table of Contents
Fetching ...

Denoising Monte Carlo Renders with Diffusion Models

Vaibhav Vavilala, Rahul Vasanth, David Forsyth

TL;DR

This work tackles Monte Carlo render noise, which is heavy-tailed at low sample counts, by employing a pixel-space diffusion model conditioned on render buffers to denoise low-spp images. The method leverages a pretrained DeepFloyd Stage II backbone with a trainable Control Module to fuse normals, albedo, depth, and other buffers, reversing the forward diffusion process with standard diffusion losses. Across multiple sampling rates, the approach is quantitatively competitive with SOTA and yields qualitatively more realistic images due to the strong image priors of the diffusion model, particularly in edges, shadows, and highlights. Practically, the method demonstrates that large-scale image foundations can be repurposed for MC denoising with substantial quality gains, albeit at higher inference cost than one-pass denoisers, with potential for future speedups and video extensions.

Abstract

Physically-based renderings contain Monte-Carlo noise, with variance that increases as the number of rays per pixel decreases. This noise, while zero-mean for good modern renderers, can have heavy tails (most notably, for scenes containing specular or refractive objects). Learned methods for restoring low fidelity renders are highly developed, because suppressing render noise means one can save compute and use fast renders with few rays per pixel. We demonstrate that a diffusion model can denoise low fidelity renders successfully. Furthermore, our method can be conditioned on a variety of natural render information, and this conditioning helps performance. Quantitative experiments show that our method is competitive with SOTA across a range of sampling rates. Qualitative examination of the reconstructions suggests that the image prior applied by a diffusion method strongly favors reconstructions that are like real images -- so have straight shadow boundaries, curved specularities and no fireflies.

Denoising Monte Carlo Renders with Diffusion Models

TL;DR

This work tackles Monte Carlo render noise, which is heavy-tailed at low sample counts, by employing a pixel-space diffusion model conditioned on render buffers to denoise low-spp images. The method leverages a pretrained DeepFloyd Stage II backbone with a trainable Control Module to fuse normals, albedo, depth, and other buffers, reversing the forward diffusion process with standard diffusion losses. Across multiple sampling rates, the approach is quantitatively competitive with SOTA and yields qualitatively more realistic images due to the strong image priors of the diffusion model, particularly in edges, shadows, and highlights. Practically, the method demonstrates that large-scale image foundations can be repurposed for MC denoising with substantial quality gains, albeit at higher inference cost than one-pass denoisers, with potential for future speedups and video extensions.

Abstract

Physically-based renderings contain Monte-Carlo noise, with variance that increases as the number of rays per pixel decreases. This noise, while zero-mean for good modern renderers, can have heavy tails (most notably, for scenes containing specular or refractive objects). Learned methods for restoring low fidelity renders are highly developed, because suppressing render noise means one can save compute and use fast renders with few rays per pixel. We demonstrate that a diffusion model can denoise low fidelity renders successfully. Furthermore, our method can be conditioned on a variety of natural render information, and this conditioning helps performance. Quantitative experiments show that our method is competitive with SOTA across a range of sampling rates. Qualitative examination of the reconstructions suggests that the image prior applied by a diffusion method strongly favors reconstructions that are like real images -- so have straight shadow boundaries, curved specularities and no fireflies.
Paper Structure (11 sections, 7 equations, 18 figures, 2 tables)

This paper contains 11 sections, 7 equations, 18 figures, 2 tables.

Figures (18)

  • Figure 1: Overview of our method. We leverage a pretrained pixel-space diffusion model, DeepFloyd Stage II DeepFloydIF, as our base synthesizer that is fixed during training. It accepts the noisy radiance as well as a forward-diffused copy of the noisy radiance (see methodology section \ref{['sec:method']} for details). We introduce a trainable Control Module, analogous to ControlNet zhang2023adding, initialized with the encoder and middle blocks. It accepts all the auxiliary feature buffers from the renderer like albedo, normals, and depth, in addition to the noisy radiance. The outputs of the Control Module are added to the DF decoder blocks at varying spatial resolutions. We utilize zero convolutions to ease early stages of training. Time and prompt encoding not shown for brevity. Our prompt is the empty string during training and inference.
  • Figure 2: Example images from our procedurally-generated dataset.
  • Figure 3: Best viewed online in color. Diffusion methods based on a latent variable image representation will not work for render denoising, because the VAE decoding creates significant problems. Existing ControlNet zhang2023adding architectures use a latent representation of the image. The limited size of the VAE dictionary limits the accuracy of very precise pixel space tasks. Upsampling the image helps, but does not remove this effect. In each row, 64x64 cropped training images are shown, highlighting texture. We take the GT training image, upsample by $k$, feed it through a VAE, decode, downsample by $k$ and show the result. We use the default VAE from Stable Diffusion 2.1. Error metrics shown underneath are averaged over 64 random training images. The VAE introduces unacceptable changes to texture. In the first row, the black patches are sharply defined in GT, but blurred after VAE decoding. In the second, the VAE shifts the color of the texture. These changes are partially due to the 8x spatial downsampling factor of VAEs, converting a (H,W,3) image into a (H/8, W/8, 4) dimensional latent code. Thus, existing control mechanisms relying on latent-space diffusion models like ControlNet are not suitable for pixel-space tasks like MC denoising. Even with 4x upscaling, which defeats much of the efficiency gain of the VAE, the PSNR is comparable with existing SOTA denoisers, which effectively caps the quality that can be achieved. Thus, in this work we introduce spatial controls to pixel-space diffusion models.
  • Figure 4: Our method requires approx. 2.8 seconds to denoise a 256x256 image and 63 seconds to denoise an HD 1080x1920 frame (without skipping time steps) on an A40 GPU. We use mixed-precision during inference and super27 DDPM sampler in DeepFloyd. This experiment suggests that we can skip around 18 of the 27 denoising time steps with negligible loss in quality, making diffusion models even more practical. At inference time, we can add gaussian noise to the low-spp render via equation \ref{['eq:fwdNoise']} instead of starting from pure noise. The gaussian noise strength needs to be sufficient to overcome the variance in the noisy render. Even though diffusion is more expensive than single-pass methods, the cost of denoising remains much smaller than the cost of rendering real-world scenes to convergence, which can be dozens of hours. Even scenes that fit into GPU memory can take dozens of minutes to render to convergence, which dwarfs the cost of denoising low-ray estimates.
  • Figure 5: Qualitatively, our reconstructions look like real images, because DeepFloyd has a very strong notion of what an image looks like (it has seen a huge dataset) and because the conditioning buffers offer strong guidelines (e.g. normals, albedo, and depth). In the final column, red arrows point to areas of interest. First row. Notice the straight edge on the shadow (ours; real images tend to have straight shadow edges) compared with blurred or blotchy edges (others). Second row. In undersampled regions, our method fills in the shadow underneath a railing (real images do not have incomplete shadows); other methods render a blurred or incomplete shadow. Third row. Notice the clean sharp highlight on the teacup handle, and smooth highlight boundaries (ours; real images have clean sharp highlight boundaries) compared with absent highlights and blotchy boundaries (others). Notice also aliasing effects in the background, most prominent for OIDN, and absent from our reconstruction. Fourth row. All methods have problems with this specular dragon. Competing methods overblur the dragon's mouth, whereas ours hallucinates plausible details. Fifth row. Fireflies, also known as spike noise, are single very bright pixels, which do not occur in real images. It is rare that fireflies appear at 64 spp in our training set, so AFGSA and Isik fail to remove them. OIDN succeeds in removing them, likely because we use their pretrained model trained on a large dataset.
  • ...and 13 more figures