Table of Contents
Fetching ...

Gear-NeRF: Free-Viewpoint Rendering and Tracking with Motion-aware Spatio-Temporal Sampling

Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang, Pedro Miraldo, Suhas Lohit, Moitreya Chatterjee

TL;DR

Gear-NeRF addresses key limitations of dynamic NeRFs by embedding semantic information via SAM into a 4D spatio-temporal representation and introducing gear based stratification. This enables motion aware spatio-temporal sampling that allocates higher resolution where motion is large, improving render realism while enabling almost free free-viewpoint tracking from simple prompts. The approach combines serial 4D feature volumes with a 4D SAM embedding and an alternating gear assignment training scheme, achieving state-of-the-art results on dynamic view synthesis and object tracking across multiple challenging datasets. Its semantic guidance and adaptive sampling hold practical impact for VR/AR, 3D animation and interactive scene understanding in dynamic environments.

Abstract

Extensions of Neural Radiance Fields (NeRFs) to model dynamic scenes have enabled their near photo-realistic, free-viewpoint rendering. Although these methods have shown some potential in creating immersive experiences, two drawbacks limit their ubiquity: (i) a significant reduction in reconstruction quality when the computing budget is limited, and (ii) a lack of semantic understanding of the underlying scenes. To address these issues, we introduce Gear-NeRF, which leverages semantic information from powerful image segmentation models. Our approach presents a principled way for learning a spatio-temporal (4D) semantic embedding, based on which we introduce the concept of gears to allow for stratified modeling of dynamic regions of the scene based on the extent of their motion. Such differentiation allows us to adjust the spatio-temporal sampling resolution for each region in proportion to its motion scale, achieving more photo-realistic dynamic novel view synthesis. At the same time, almost for free, our approach enables free-viewpoint tracking of objects of interest - a functionality not yet achieved by existing NeRF-based methods. Empirical studies validate the effectiveness of our method, where we achieve state-of-the-art rendering and tracking performance on multiple challenging datasets.

Gear-NeRF: Free-Viewpoint Rendering and Tracking with Motion-aware Spatio-Temporal Sampling

TL;DR

Gear-NeRF addresses key limitations of dynamic NeRFs by embedding semantic information via SAM into a 4D spatio-temporal representation and introducing gear based stratification. This enables motion aware spatio-temporal sampling that allocates higher resolution where motion is large, improving render realism while enabling almost free free-viewpoint tracking from simple prompts. The approach combines serial 4D feature volumes with a 4D SAM embedding and an alternating gear assignment training scheme, achieving state-of-the-art results on dynamic view synthesis and object tracking across multiple challenging datasets. Its semantic guidance and adaptive sampling hold practical impact for VR/AR, 3D animation and interactive scene understanding in dynamic environments.

Abstract

Extensions of Neural Radiance Fields (NeRFs) to model dynamic scenes have enabled their near photo-realistic, free-viewpoint rendering. Although these methods have shown some potential in creating immersive experiences, two drawbacks limit their ubiquity: (i) a significant reduction in reconstruction quality when the computing budget is limited, and (ii) a lack of semantic understanding of the underlying scenes. To address these issues, we introduce Gear-NeRF, which leverages semantic information from powerful image segmentation models. Our approach presents a principled way for learning a spatio-temporal (4D) semantic embedding, based on which we introduce the concept of gears to allow for stratified modeling of dynamic regions of the scene based on the extent of their motion. Such differentiation allows us to adjust the spatio-temporal sampling resolution for each region in proportion to its motion scale, achieving more photo-realistic dynamic novel view synthesis. At the same time, almost for free, our approach enables free-viewpoint tracking of objects of interest - a functionality not yet achieved by existing NeRF-based methods. Empirical studies validate the effectiveness of our method, where we achieve state-of-the-art rendering and tracking performance on multiple challenging datasets.
Paper Structure (18 sections, 12 equations, 9 figures, 10 tables)

This paper contains 18 sections, 12 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: (a) Our method takes RGB videos captured from a camera array as input. (b) Trained Gear-NeRF achieves photo-realistic real-time free-viewpoint rendering of a dynamic scene. (c) With users giving a single click at any time and from any viewpoint, our method can perform free-viewpoint tracking of the target object.
  • Figure 2: Pipeline of Gear-NeRF: Gear-NeRF takes multi-view videos as input. After optimizing the serial 4D feature volumes (\ref{['sec:representation']}), it maps space-time coordinates to a 4D semantic embedding (\ref{['sec:sam_embedding']}), in addition to the volume density and view-dependent radiance color. Regions with larger motion are automatically assigned higher gear levels (\ref{['sec:gear_determine']}) and as a result, receive higher-resolution spatio-temporal sampling (\ref{['sec:st_sampling']}). Furthermore, Gear-NeRF is capable of performing free-viewpoint tracking of a target object with prompts as simple as a user click (\ref{['sec:novel_tracking']}).
  • Figure 3: Illustration of Gear Assignment Update: For each gear assignment update, we calculate the rendering loss map between the rendered RGB-SAM map and the ground truth and identify the centers of the patches with the maximum and minimum losses, marked in red and green (second column). These points are then fed into the SAM decoder as positive and negative prompts to generate an upshift mask representing the areas that need to be shifted to a higher gear (last column). After the first gear assignment update, we see that the next candidate region for upshift is situated where the horse is located, and so on. Upshift mask colors imply the gear levels after the update (blue-2, green-3, red-4).
  • Figure 4: Motion-aware Spatial Sampling: We split each sampled point into $2 ^ {p(\mathbf{x}, t)}$ points, equally spaced within the corresponding ray segment. The top row shows the vanilla uniformly sampled points, while the bottom one shows the densely sampled points after splitting using our strategy.
  • Figure 5: Qualitative comparisons for novel view synthesis on the Technicolor dataset technicolor: We qualitatively compare our approach against HyperReel attal2023hyperreel and Neural 3D Video li2022neural. Our approach better recovers fine details like patterns on the toys or stripes on the shirt.
  • ...and 4 more figures