Table of Contents
Fetching ...

Im4D: High-Fidelity and Real-Time Novel View Synthesis for Dynamic Scenes

Haotong Lin, Sida Peng, Zhen Xu, Tao Xie, Xingyi He, Hujun Bao, Xiaowei Zhou

TL;DR

Im4D tackles dynamic view synthesis by fusing a global grid-based 4D geometry with a multi-view image-based appearance model, enabling high-fidelity rendering at real-time speeds. The geometry is stored as six orthogonal feature planes that form a continuous 4D density, while appearance is inferred from nearby input views via a CNN feature extractor and a small, order-invariant network. This hybrid design yields state-of-the-art rendering quality and efficient training across multiple dynamic-scene benchmarks, achieving 79.8 FPS for 512×512 images on a single RTX 3090. The method also introduces training and rendering accelerations, including a binary-field occupancy surrogate and a targeted training schedule, contributing to robustness and practicality for real-time applications. Limitations include handling occlusions with monocular inputs, suggesting directions for future occlusion-aware view synthesis and view-selection strategies.

Abstract

This paper aims to tackle the challenge of dynamic view synthesis from multi-view videos. The key observation is that while previous grid-based methods offer consistent rendering, they fall short in capturing appearance details of a complex dynamic scene, a domain where multi-view image-based rendering methods demonstrate the opposite properties. To combine the best of two worlds, we introduce Im4D, a hybrid scene representation that consists of a grid-based geometry representation and a multi-view image-based appearance representation. Specifically, the dynamic geometry is encoded as a 4D density function composed of spatiotemporal feature planes and a small MLP network, which globally models the scene structure and facilitates the rendering consistency. We represent the scene appearance by the original multi-view videos and a network that learns to predict the color of a 3D point from image features, instead of memorizing detailed appearance totally with networks, thereby naturally making the learning of networks easier. Our method is evaluated on five dynamic view synthesis datasets including DyNeRF, ZJU-MoCap, NHR, DNA-Rendering and ENeRF-Outdoor datasets. The results show that Im4D exhibits state-of-the-art performance in rendering quality and can be trained efficiently, while realizing real-time rendering with a speed of 79.8 FPS for 512x512 images, on a single RTX 3090 GPU.

Im4D: High-Fidelity and Real-Time Novel View Synthesis for Dynamic Scenes

TL;DR

Im4D tackles dynamic view synthesis by fusing a global grid-based 4D geometry with a multi-view image-based appearance model, enabling high-fidelity rendering at real-time speeds. The geometry is stored as six orthogonal feature planes that form a continuous 4D density, while appearance is inferred from nearby input views via a CNN feature extractor and a small, order-invariant network. This hybrid design yields state-of-the-art rendering quality and efficient training across multiple dynamic-scene benchmarks, achieving 79.8 FPS for 512×512 images on a single RTX 3090. The method also introduces training and rendering accelerations, including a binary-field occupancy surrogate and a targeted training schedule, contributing to robustness and practicality for real-time applications. Limitations include handling occlusions with monocular inputs, suggesting directions for future occlusion-aware view synthesis and view-selection strategies.

Abstract

This paper aims to tackle the challenge of dynamic view synthesis from multi-view videos. The key observation is that while previous grid-based methods offer consistent rendering, they fall short in capturing appearance details of a complex dynamic scene, a domain where multi-view image-based rendering methods demonstrate the opposite properties. To combine the best of two worlds, we introduce Im4D, a hybrid scene representation that consists of a grid-based geometry representation and a multi-view image-based appearance representation. Specifically, the dynamic geometry is encoded as a 4D density function composed of spatiotemporal feature planes and a small MLP network, which globally models the scene structure and facilitates the rendering consistency. We represent the scene appearance by the original multi-view videos and a network that learns to predict the color of a 3D point from image features, instead of memorizing detailed appearance totally with networks, thereby naturally making the learning of networks easier. Our method is evaluated on five dynamic view synthesis datasets including DyNeRF, ZJU-MoCap, NHR, DNA-Rendering and ENeRF-Outdoor datasets. The results show that Im4D exhibits state-of-the-art performance in rendering quality and can be trained efficiently, while realizing real-time rendering with a speed of 79.8 FPS for 512x512 images, on a single RTX 3090 GPU.
Paper Structure (27 sections, 6 equations, 7 figures, 6 tables)

This paper contains 27 sections, 6 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Overview of Im4D. Given a set of multi-view videos, the proposed method aims to reconstruct a 3D model capable of rendering photorealistic images at arbitrary viewpoints and time steps. The proposed method models the geometry with a global 4D density function. This function consists of a small MLP and a 4D space structure storing optimizable features. The appearance part is represented with a multi-view image-based appearance model, which learns to predict the color of a 3D point from image features extracted from selected images (the input images closest to the rendering view).
  • Figure 2: Qualitative comparison of image synthesis results on the DNA-Rendering dataset. The upperscript * implies that the results are obtained with extensive per-scene fine-tuning. IBRNet and ENeRF often produce artifacts in thin structures or occluded regions. Our method produces high-fidelity rendering and superior results in these regions, owing to the global geometry representation. K-Planes struggles to recover the appearance details.
  • Figure 3: Qualitative comparison on the DyNeRF dataset.
  • Figure 4: Qualitative ablation study on the NHR dataset. "app." and "geo." denote appearance and geometry, respectively.
  • Figure 5: Evaluation details on the DyNeRF dataset. The left image is the first frame (test frame) of a one-second video clip, and the right image is the average frame of this second. We identify the 6 patches with the largest differences between the test frame and the average frame. During the quantitative evaluation, we only assess these patches.
  • ...and 2 more figures