Table of Contents
Fetching ...

CTNeRF: Cross-Time Transformer for Dynamic Neural Radiance Field from Monocular Video

Xingyu Miao, Yang Bai, Haoran Duan, Yawen Huang, Fan Wan, Yang Long, Yefeng Zheng

TL;DR

This work tackles high-quality novel-view synthesis from monocular videos of dynamic scenes, where existing NeRF-based methods struggle with complex object motion. It introduces CTNeRF, which integrates a Ray-Based Cross-Time Transformer (RBCT) and a Global Spatio-Temporal Filter (GSTF) to fuse temporal, spatial, and frequency-domain information, while maintaining separate static and dynamic branches for background and foreground. The approach employs multi-view feature aggregation, cross-time attention, ray-wise fusion, and regularization from depth/flow priors, achieving state-of-the-art results on dynamic datasets with improved sharpness and fewer artifacts in dynamic regions. While showing strong performance, the method acknowledges limitations with very long sequences and non-rigid deformations, pointing to future work on longer-range aggregation and scalable neural representations to further boost efficiency and quality.

Abstract

The goal of our work is to generate high-quality novel views from monocular videos of complex and dynamic scenes. Prior methods, such as DynamicNeRF, have shown impressive performance by leveraging time-varying dynamic radiation fields. However, these methods have limitations when it comes to accurately modeling the motion of complex objects, which can lead to inaccurate and blurry renderings of details. To address this limitation, we propose a novel approach that builds upon a recent generalization NeRF, which aggregates nearby views onto new viewpoints. However, such methods are typically only effective for static scenes. To overcome this challenge, we introduce a module that operates in both the time and frequency domains to aggregate the features of object motion. This allows us to learn the relationship between frames and generate higher-quality images. Our experiments demonstrate significant improvements over state-of-the-art methods on dynamic scene datasets. Specifically, our approach outperforms existing methods in terms of both the accuracy and visual quality of the synthesized views. Our code is available on https://github.com/xingy038/CTNeRF.

CTNeRF: Cross-Time Transformer for Dynamic Neural Radiance Field from Monocular Video

TL;DR

This work tackles high-quality novel-view synthesis from monocular videos of dynamic scenes, where existing NeRF-based methods struggle with complex object motion. It introduces CTNeRF, which integrates a Ray-Based Cross-Time Transformer (RBCT) and a Global Spatio-Temporal Filter (GSTF) to fuse temporal, spatial, and frequency-domain information, while maintaining separate static and dynamic branches for background and foreground. The approach employs multi-view feature aggregation, cross-time attention, ray-wise fusion, and regularization from depth/flow priors, achieving state-of-the-art results on dynamic datasets with improved sharpness and fewer artifacts in dynamic regions. While showing strong performance, the method acknowledges limitations with very long sequences and non-rigid deformations, pointing to future work on longer-range aggregation and scalable neural representations to further boost efficiency and quality.

Abstract

The goal of our work is to generate high-quality novel views from monocular videos of complex and dynamic scenes. Prior methods, such as DynamicNeRF, have shown impressive performance by leveraging time-varying dynamic radiation fields. However, these methods have limitations when it comes to accurately modeling the motion of complex objects, which can lead to inaccurate and blurry renderings of details. To address this limitation, we propose a novel approach that builds upon a recent generalization NeRF, which aggregates nearby views onto new viewpoints. However, such methods are typically only effective for static scenes. To overcome this challenge, we introduce a module that operates in both the time and frequency domains to aggregate the features of object motion. This allows us to learn the relationship between frames and generate higher-quality images. Our experiments demonstrate significant improvements over state-of-the-art methods on dynamic scene datasets. Specifically, our approach outperforms existing methods in terms of both the accuracy and visual quality of the synthesized views. Our code is available on https://github.com/xingy038/CTNeRF.
Paper Structure (30 sections, 21 equations, 10 figures, 8 tables, 1 algorithm)

This paper contains 30 sections, 21 equations, 10 figures, 8 tables, 1 algorithm.

Figures (10)

  • Figure 1: The pipeline of our model. Our model is composed of two main parts, each responsible for handling a different aspect of the input data. One component focuses on the static background, while the other deals with the dynamic foreground. These two sets of values are then blended together to obtain the final novel view.
  • Figure 2: Aggregating feature vectors in an epipolar-aligned manner will cause errors in the rendering of the model, resulting in artifacts that degrade the quality of the model rendering novel views.
  • Figure 3: The pipeline of the RBCT module. The model consists of two main components: the cross-time transformer on the left and the ray transformer on the right. The left component takes a set of feature vectors from consecutive frames as input and applies cross-time attention to aggregate these vectors with the current frame. The resulting feature vector is then passed to the right component, which uses ray attention to aggregate feature vectors from multiple sampling points along each ray. Finally, a pooling operation is applied to these vectors to obtain the final aggregated feature vector.
  • Figure 4: Network architectures of our static and dynamic representations.
  • Figure 5: Novel view synthetic qualitative results on Nvidia Dynamic Scene Dataset yoon2020novel. In contrast to other NeRF-based approaches, our outcomes exhibit enhanced clarity, capturing finer details that closely approximate ground truth, particularly in dynamic regions.
  • ...and 5 more figures