Table of Contents
Fetching ...

Advances in Neural Rendering

Ayush Tewari, Justus Thies, Ben Mildenhall, Pratul Srinivasan, Edgar Tretschk, Yifan Wang, Christoph Lassner, Vincent Sitzmann, Ricardo Martin-Brualla, Stephen Lombardi, Tomas Simon, Christian Theobalt, Matthias Niessner, Jonathan T. Barron, Gordon Wetzstein, Michael Zollhoefer, Vladislav Golyanik

TL;DR

Neural rendering aims to synthesize photo-realistic imagery by learning 3D scene representations that integrate with differentiable image formation, yielding 3D-consistent novel-view synthesis. The field centers on neural radiance field (NeRF) paradigms and volumetric renderings that use coordinate-based MLPs to represent density and radiance, trained from 2D observations via differentiable rendering. This STAR surveys a broad landscape of scene representations (surfaces, volumes, implicit/explicit), rendering strategies (ray casting, rasterization), and optimization practices, then distills advances in static and dynamic view synthesis, generalization, editing, relighting, light fields, and engineering frameworks. The work highlights significant contributions like speedups (PlenOctrees, Instant-NGP), generalization via local/global conditioning and latent codes, and controllable dynamic NeRFs, while acknowledging open challenges in scalability, interpretability, and societal impact of photorealistic synthetic media. Overall, neural rendering is poised to transform content creation and visualization, offering strong 3D control from 2D data, but it also necessitates careful attention to ethics, robustness, and computational demands.

Abstract

Synthesizing photo-realistic images and videos is at the heart of computer graphics and has been the focus of decades of research. Traditionally, synthetic images of a scene are generated using rendering algorithms such as rasterization or ray tracing, which take specifically defined representations of geometry and material properties as input. Collectively, these inputs define the actual scene and what is rendered, and are referred to as the scene representation (where a scene consists of one or more objects). Example scene representations are triangle meshes with accompanied textures (e.g., created by an artist), point clouds (e.g., from a depth sensor), volumetric grids (e.g., from a CT scan), or implicit surface functions (e.g., truncated signed distance fields). The reconstruction of such a scene representation from observations using differentiable rendering losses is known as inverse graphics or inverse rendering. Neural rendering is closely related, and combines ideas from classical computer graphics and machine learning to create algorithms for synthesizing images from real-world observations. Neural rendering is a leap forward towards the goal of synthesizing photo-realistic image and video content. In recent years, we have seen immense progress in this field through hundreds of publications that show different ways to inject learnable components into the rendering pipeline. This state-of-the-art report on advances in neural rendering focuses on methods that combine classical rendering principles with learned 3D scene representations, often now referred to as neural scene representations. A key advantage of these methods is that they are 3D-consistent by design, enabling applications such as novel viewpoint synthesis of a captured scene. In addition to methods that handle static scenes, we cover neural scene representations for modeling non-rigidly deforming objects...

Advances in Neural Rendering

TL;DR

Neural rendering aims to synthesize photo-realistic imagery by learning 3D scene representations that integrate with differentiable image formation, yielding 3D-consistent novel-view synthesis. The field centers on neural radiance field (NeRF) paradigms and volumetric renderings that use coordinate-based MLPs to represent density and radiance, trained from 2D observations via differentiable rendering. This STAR surveys a broad landscape of scene representations (surfaces, volumes, implicit/explicit), rendering strategies (ray casting, rasterization), and optimization practices, then distills advances in static and dynamic view synthesis, generalization, editing, relighting, light fields, and engineering frameworks. The work highlights significant contributions like speedups (PlenOctrees, Instant-NGP), generalization via local/global conditioning and latent codes, and controllable dynamic NeRFs, while acknowledging open challenges in scalability, interpretability, and societal impact of photorealistic synthetic media. Overall, neural rendering is poised to transform content creation and visualization, offering strong 3D control from 2D data, but it also necessitates careful attention to ethics, robustness, and computational demands.

Abstract

Synthesizing photo-realistic images and videos is at the heart of computer graphics and has been the focus of decades of research. Traditionally, synthetic images of a scene are generated using rendering algorithms such as rasterization or ray tracing, which take specifically defined representations of geometry and material properties as input. Collectively, these inputs define the actual scene and what is rendered, and are referred to as the scene representation (where a scene consists of one or more objects). Example scene representations are triangle meshes with accompanied textures (e.g., created by an artist), point clouds (e.g., from a depth sensor), volumetric grids (e.g., from a CT scan), or implicit surface functions (e.g., truncated signed distance fields). The reconstruction of such a scene representation from observations using differentiable rendering losses is known as inverse graphics or inverse rendering. Neural rendering is closely related, and combines ideas from classical computer graphics and machine learning to create algorithms for synthesizing images from real-world observations. Neural rendering is a leap forward towards the goal of synthesizing photo-realistic image and video content. In recent years, we have seen immense progress in this field through hundreds of publications that show different ways to inject learnable components into the rendering pipeline. This state-of-the-art report on advances in neural rendering focuses on methods that combine classical rendering principles with learned 3D scene representations, often now referred to as neural scene representations. A key advantage of these methods is that they are 3D-consistent by design, enabling applications such as novel viewpoint synthesis of a captured scene. In addition to methods that handle static scenes, we cover neural scene representations for modeling non-rigidly deforming objects...

Paper Structure

This paper contains 33 sections, 11 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: The term "Neural Rendering" is often applied to what are two distinct concepts. The previous STAR report on neural rendering tewari2020neuralrendering primarily focused on the paradigm shown in (\ref{['subfig:paradigm1']}), in which a neural network is trained to map from some 2D input signal (such as a semantic label or a rasterized proxy geometry) directly to the output image --- the neural network is trained to render. This report focuses on a newer emerging paradigm for neural rendering, shown in (\ref{['subfig:paradigm2']}) and well-exemplified by NeRF Mildenhall_2020_NeRF. Here, a neural network is supervised so as to represent the shape or appearance of a particular scene, and that the neural representation is rendered using a somewhat conventional graphics "engine" that is defined analytically, instead of being learned. Unlike the previous paradigm, here the neural network does not learn how to render --- it instead learns to represent a scene in 3D, and that scene is then rendered according to the physics of image formation. Image adapted from meshry2019neural.
  • Figure 2: An overview of classical surface and volume representations. Images adapted from greger1998irradiancevoxelizationyariv2021volume_sdf_figureVladsinger2009ChumpusRex2006.
  • Figure 3: For explicit surfaces representations, the surface is directly indexable. This allows us to use forward rendering methods that project the surface to the image plane and to set a pixel accordingly (e.g., using rasterization or point splatting). Implicit surface representations and volumetric representations, do not provide direct information of the surface that would allow for forward rendering, instead, the 3D space seen from the virtual camera has to be sampled to generate an image (e.g., using ray marching).
  • Figure 4: An overview of the neural radiance field (NeRF) scene representation and volume rendering procedure. NeRF synthesizes images by sampling 5D coordinates (location and viewing direction) along camera rays (a), feeding those locations into an MLP to produce color and volume density (b), and using volume rendering to composite these values into an image (c). Since this rendering function is differentiable, the NeRF scene representation MLP can be optimized by minimizing the residual between synthesized and ground truth observed images (d). Digital zoom recommended. Image adapted from Mildenhall_2020_NeRF.
  • Figure 5: Instead of sampling points $\mathbf{x}$ along the rays traced from the camera projection center (a), MipNeRF barron2021mipnerf reasons about 3D canonical frustum per camera pixel (b). Image adapted from barron2021mipnerf © 2021 IEEE.
  • ...and 8 more figures