Table of Contents
Fetching ...

Robust Average Networks for Monte Carlo Denoising

Javor Kalojanov, Kimball Thurston

TL;DR

The paper addresses the challenge of denoising Monte Carlo renders in production by introducing Robust Average blocks that convert spatial kernel-predictive networks into bidirectional spatio-temporal denoisers. These blocks perform learned, robust temporal interpolation over a fixed window, use motion-compensated warping, and are trained with a spatial-to-temporal loss formulation to encourage temporal information usage without ground-truth sequences, complemented by thresholded kernel predictions to suppress outliers. Key contributions include RA blocks inserted at multiple depths, a temporal loss reformulation, and empirical evidence showing improved temporal coherence and edge preservation with competitive perceptual metrics, albeit with increased model complexity and inference time. The work enables production-friendly, temporally stable denoising for complex VFX scenes and lays groundwork for further improvements in temporal denoising efficiency and robustness.

Abstract

We present a method for converting denoising neural networks from spatial into spatio-temporal ones by modifying the network architecture and loss function. We insert Robust Average blocks at arbitrary depths in the network graph. Each block performs latent space interpolation with trainable weights and works on the sequence of image representations from the preceding spatial components of the network. The temporal connections are kept live during training by forcing the network to predict a denoised frame from subsets of the input sequence. Using temporal coherence for denoising improves image quality and reduces temporal flickering independent of scene or image complexity.

Robust Average Networks for Monte Carlo Denoising

TL;DR

The paper addresses the challenge of denoising Monte Carlo renders in production by introducing Robust Average blocks that convert spatial kernel-predictive networks into bidirectional spatio-temporal denoisers. These blocks perform learned, robust temporal interpolation over a fixed window, use motion-compensated warping, and are trained with a spatial-to-temporal loss formulation to encourage temporal information usage without ground-truth sequences, complemented by thresholded kernel predictions to suppress outliers. Key contributions include RA blocks inserted at multiple depths, a temporal loss reformulation, and empirical evidence showing improved temporal coherence and edge preservation with competitive perceptual metrics, albeit with increased model complexity and inference time. The work enables production-friendly, temporally stable denoising for complex VFX scenes and lays groundwork for further improvements in temporal denoising efficiency and robustness.

Abstract

We present a method for converting denoising neural networks from spatial into spatio-temporal ones by modifying the network architecture and loss function. We insert Robust Average blocks at arbitrary depths in the network graph. Each block performs latent space interpolation with trainable weights and works on the sequence of image representations from the preceding spatial components of the network. The temporal connections are kept live during training by forcing the network to predict a denoised frame from subsets of the input sequence. Using temporal coherence for denoising improves image quality and reduces temporal flickering independent of scene or image complexity.
Paper Structure (9 sections, 6 equations, 10 figures, 2 tables)

This paper contains 9 sections, 6 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: A sketch of the recurrent ResNet network model used for evaluation. The network consists of 24 residual blocks, RRA blocks added after blocks 3,6,9,12,15 and 24, and skip connections after the first 4 RA blocks. The internal dimension of all convolutional layers is 80, except for the kernel predictive layers, which are $5\times5\times5=125$ large.
  • Figure 2: Robust Average block for a sequence of length 5. We exclude the first frame, average the remaining frames, and interpolate between the average and the excluded frame. This is repeated (recurrently) until each frame of the sequence has been interpolated with the robust average of the other frames.
  • Figure 3: Increasing the value of the kernel threshold $t$ adjusts the influence of the denoiser on the image. The figure shows denoised images and scaled difference to the noisy input image. © 20th Century Studios / Walt Disney Studios Motion Pictures
  • Figure 4: Spatial to temporal loss conversion for a kernel-predictive neural network. Here, we add two temporal loss terms that force the network to predict the center frame from pairs of one previous and one subsequent frame. We use the same reference frame $ref_0$ in each term.
  • Figure 5: Our denoiser delivers similar image quality with slightly better image details compared to a baseline (tKPCN) network without RA blocks. A spatial UNet keeps detail by preserving noise, and sacrificing temporal stability. © 20th Century Studios / Walt Disney Studios Motion Pictures
  • ...and 5 more figures