Table of Contents
Fetching ...

3D Reconstruction from Transient Measurements with Time-Resolved Transformer

Yue Li, Shida Sun, Yu Hong, Feihu Xu, Zhiwei Xiong

TL;DR

This paper introduces Time-Resolved Transformer (TRT), a transformer-based architecture designed to exploit local and global spatio-temporal correlations in transient measurements for photon-efficient 3D reconstruction. It defines two attention mechanisms—spatio-temporal self-attention encoders and spatio-temporal cross attention decoders—to produce deep local and global feature representations, which are fused to reconstruct LOS and NLOS scenes. TRT-LOS and TRT-NLOS demonstrate state-of-the-art performance on synthetic and real-world data, with a dedicated transient denoiser for NLOS and large synthetic LOS datasets to support training. The approach offers robust generalization across different imaging systems and sensor noise levels, advancing practical 3D imaging in challenging environments.

Abstract

Transient measurements, captured by the timeresolved systems, are widely employed in photon-efficient reconstruction tasks, including line-of-sight (LOS) and non-line-of-sight (NLOS) imaging. However, challenges persist in their 3D reconstruction due to the low quantum efficiency of sensors and the high noise levels, particularly for long-range or complex scenes. To boost the 3D reconstruction performance in photon-efficient imaging, we propose a generic Time-Resolved Transformer (TRT) architecture. Different from existing transformers designed for high-dimensional data, TRT has two elaborate attention designs tailored for the spatio-temporal transient measurements. Specifically, the spatio-temporal self-attention encoders explore both local and global correlations within transient data by splitting or downsampling input features into different scales. Then, the spatio-temporal cross attention decoders integrate the local and global features in the token space, resulting in deep features with high representation capabilities. Building on TRT, we develop two task-specific embodiments: TRT-LOS for LOS imaging and TRT-NLOS for NLOS imaging. Extensive experiments demonstrate that both embodiments significantly outperform existing methods on synthetic data and real-world data captured by different imaging systems. In addition, we contribute a large-scale, high-resolution synthetic LOS dataset with various noise levels and capture a set of real-world NLOS measurements using a custom-built imaging system, enhancing the data diversity in this field. Code and datasets are available at https://github.com/Depth2World/TRT.

3D Reconstruction from Transient Measurements with Time-Resolved Transformer

TL;DR

This paper introduces Time-Resolved Transformer (TRT), a transformer-based architecture designed to exploit local and global spatio-temporal correlations in transient measurements for photon-efficient 3D reconstruction. It defines two attention mechanisms—spatio-temporal self-attention encoders and spatio-temporal cross attention decoders—to produce deep local and global feature representations, which are fused to reconstruct LOS and NLOS scenes. TRT-LOS and TRT-NLOS demonstrate state-of-the-art performance on synthetic and real-world data, with a dedicated transient denoiser for NLOS and large synthetic LOS datasets to support training. The approach offers robust generalization across different imaging systems and sensor noise levels, advancing practical 3D imaging in challenging environments.

Abstract

Transient measurements, captured by the timeresolved systems, are widely employed in photon-efficient reconstruction tasks, including line-of-sight (LOS) and non-line-of-sight (NLOS) imaging. However, challenges persist in their 3D reconstruction due to the low quantum efficiency of sensors and the high noise levels, particularly for long-range or complex scenes. To boost the 3D reconstruction performance in photon-efficient imaging, we propose a generic Time-Resolved Transformer (TRT) architecture. Different from existing transformers designed for high-dimensional data, TRT has two elaborate attention designs tailored for the spatio-temporal transient measurements. Specifically, the spatio-temporal self-attention encoders explore both local and global correlations within transient data by splitting or downsampling input features into different scales. Then, the spatio-temporal cross attention decoders integrate the local and global features in the token space, resulting in deep features with high representation capabilities. Building on TRT, we develop two task-specific embodiments: TRT-LOS for LOS imaging and TRT-NLOS for NLOS imaging. Extensive experiments demonstrate that both embodiments significantly outperform existing methods on synthetic data and real-world data captured by different imaging systems. In addition, we contribute a large-scale, high-resolution synthetic LOS dataset with various noise levels and capture a set of real-world NLOS measurements using a custom-built imaging system, enhancing the data diversity in this field. Code and datasets are available at https://github.com/Depth2World/TRT.

Paper Structure

This paper contains 30 sections, 17 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: The schematic diagrams of line-of-sight imaging and non-line-of-sight imaging, with the example of the local and global correlations in the transient measurements. In line-of-sight imaging, the scene is oriented towards the system, while the scene faces the relay wall in non-line-of-sight imaging. The points and patches shown in the diagram refer to the regions within the scanning area that correspond to the orientation of the scene.
  • Figure 2: (a) An overview of the proposed time-resolved transformer. The symbols "$\downarrow$" and "$\uparrow$" with a circular block denote the downsampling and upsampling operators along the spatial dimension. The subscript $S$, $L$, and $G$ represent the shallow, local and global feature, while the superscript $*$ indicates the deep features. (b) An overview of spatio-temporal self-attention encoder. (c) An overview of spatio-temporal cross attention decoder."$\times$" with a circle denotes the matrix multiplication.
  • Figure 3: The flowchart of our proposed TRT-LOS. "C" and "D" with rectangular blocks denote the 3D convolution and 3D dilated convolution, respectively, with their kernel sizes behind. "C" with circular blocks denotes the concatenation. "DS" and "TDS" denotes the downsampling operators along the spatio-temporal and temporal dimension, respectively. "TPF" with a rectangular block denotes the pixel shuffle as the upsampling operator along the temporal dimension.
  • Figure 4: Thumbnails of the synthetic test scenes from Middlebury2014 dataset.
  • Figure 5: Reconstructed results from the simulated test set under different SBR conditions. The odd and even rows are the depth maps and depth error maps, respectively. The last column lists the ground-truth depth map and intensity image. The color bars show the value of depth and the error map..
  • ...and 5 more figures