Table of Contents
Fetching ...

TK-Planes: Tiered K-Planes with High Dimensional Feature Vectors for Dynamic UAV-based Scenes

Christopher Maxey, Jaehoon Choi, Yonghan Lee, Hyungtae Lee, Dinesh Manocha, Heesung Kwon

TL;DR

This paper proposes an extension of K-Planes Neural Radiance Field (NeRF), wherein the algorithm stores a set of tiered high dimensional feature vectors, generated to effectively model conceptual information about a scene as well as to be processed by an image decoder that transforms output feature maps into RGB images.

Abstract

In this paper, we present a new approach to bridge the domain gap between synthetic and real-world data for unmanned aerial vehicle (UAV)-based perception. Our formulation is designed for dynamic scenes, consisting of small moving objects or human actions. We propose an extension of K-Planes Neural Radiance Field (NeRF), wherein our algorithm stores a set of tiered feature vectors. The tiered feature vectors are generated to effectively model conceptual information about a scene as well as an image decoder that transforms output feature maps into RGB images. Our technique leverages the information amongst both static and dynamic objects within a scene and is able to capture salient scene attributes of high altitude videos. We evaluate its performance on challenging datasets, including Okutama Action and UG2, and observe considerable improvement in accuracy over state of the art neural rendering methods.

TK-Planes: Tiered K-Planes with High Dimensional Feature Vectors for Dynamic UAV-based Scenes

TL;DR

This paper proposes an extension of K-Planes Neural Radiance Field (NeRF), wherein the algorithm stores a set of tiered high dimensional feature vectors, generated to effectively model conceptual information about a scene as well as to be processed by an image decoder that transforms output feature maps into RGB images.

Abstract

In this paper, we present a new approach to bridge the domain gap between synthetic and real-world data for unmanned aerial vehicle (UAV)-based perception. Our formulation is designed for dynamic scenes, consisting of small moving objects or human actions. We propose an extension of K-Planes Neural Radiance Field (NeRF), wherein our algorithm stores a set of tiered feature vectors. The tiered feature vectors are generated to effectively model conceptual information about a scene as well as an image decoder that transforms output feature maps into RGB images. Our technique leverages the information amongst both static and dynamic objects within a scene and is able to capture salient scene attributes of high altitude videos. We evaluate its performance on challenging datasets, including Okutama Action and UG2, and observe considerable improvement in accuracy over state of the art neural rendering methods.
Paper Structure (15 sections, 3 equations, 8 figures, 2 tables)

This paper contains 15 sections, 3 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Example neural rendering results. Our target dynamic scene contains small, fast-moving people (in red box) captured using only a monocular camera. Despite this challenging scenario, our proposed method produces high-quality renderings for dynamic scene.
  • Figure 2: Overview of our tiered planes algorithm: An arbitrary $n$ number of grids are shown, representing different scales of feature space, with larger scales (earlier feature maps) capturing more abstract scene details. Note that the algorithm uses nine grids in total per scale, three static spatial, three dynamic spatial and three dynamic spatio-temporal. Feature vectors are processed via concurrent MLPs to output a density value and final feature vector. These are accumulated with volumetric rendering into a feature map. Feature maps from each set of grids are input into corresponding stages of the image decoder.
  • Figure 3: a.) A diagram of a block within the image decoder. The $n^{th}$ block accepts up to two inputs, a feature map from the $n^{th}$ set of grids, if applicable, and the output $O_{n-1}$ from the previous block. At $n=0$, there is no input from a previous block, only the feature map. Beyond the last set of grids, there is no feature map input, only the input from the previous block. b.) The final image decoder block with $\text{x}N$$3\text{x}3$ convolutional layers followed by a final convolutional layer to reduce the channel size to three and a sigmoid activation layer to rectify the raw output into RGB values.
  • Figure 4: A example frame from each of the four scenes used to validate our experiments. Each scene from Okutama-Action offers a unique viewpoint, time-of-day and number of people. The scene from UG2 offers a challenging viewpoint from a quadrocopter of a cricket match in session.
  • Figure 5: A comparison of dynamic regions from a subsection of video 1.1.1 in Okutama-Action. a.) Okutama ground truth, b.) stock K-Planes, c.) Extended K-Planes, d.) 4D-Gaussian, e.) TK-Planes. We highlight the improved rendering quality generated by TK-Planes over the other methods on these challenging dynamic scenes.
  • ...and 3 more figures