Table of Contents
Fetching ...

TransiT: Transient Transformer for Non-line-of-sight Videography

Ruiqian Li, Siyuan Shen, Suan Xia, Ziheng Wang, Xingyue Peng, Chengxuan Song, Yingsheng Zhu, Tao Wu, Shiying Li, Jingyi Yu

TL;DR

TransiT tackles the challenge of real-time, high-quality NLOS videography under fast-scan distortions by introducing a Transient Transformer that directly compresses temporal transients and fuses frame-difference features within a spatiotemporal Transformer. A distortion model under fast scanning is paired with a two-stage training regime, including an MMD-based transfer loss, enabling robust reconstruction from sparse 16×16 transients to 64×64 videos at 10 FPS. The approach excels on synthetic and real-measured data, outperforming state-of-the-art methods in reconstruction quality and frame-rate practicality, and demonstrates practical potential for autonomous navigation and post-disaster search. The work also provides a large synthetic dataset and codebase to advance NLOS videography research.

Abstract

High quality and high speed videography using Non-Line-of-Sight (NLOS) imaging benefit autonomous navigation, collision prevention, and post-disaster search and rescue tasks. Current solutions have to balance between the frame rate and image quality. High frame rates, for example, can be achieved by reducing either per-point scanning time or scanning density, but at the cost of lowering the information density at individual frames. Fast scanning process further reduces the signal-to-noise ratio and different scanning systems exhibit different distortion characteristics. In this work, we design and employ a new Transient Transformer architecture called TransiT to achieve real-time NLOS recovery under fast scans. TransiT directly compresses the temporal dimension of input transients to extract features, reducing computation costs and meeting high frame rate requirements. It further adopts a feature fusion mechanism as well as employs a spatial-temporal Transformer to help capture features of NLOS transient videos. Moreover, TransiT applies transfer learning to bridge the gap between synthetic and real-measured data. In real experiments, TransiT manages to reconstruct from sparse transients of $16 \times 16$ measured at an exposure time of 0.4 ms per point to NLOS videos at a $64 \times 64$ resolution at 10 frames per second. We will make our code and dataset available to the community.

TransiT: Transient Transformer for Non-line-of-sight Videography

TL;DR

TransiT tackles the challenge of real-time, high-quality NLOS videography under fast-scan distortions by introducing a Transient Transformer that directly compresses temporal transients and fuses frame-difference features within a spatiotemporal Transformer. A distortion model under fast scanning is paired with a two-stage training regime, including an MMD-based transfer loss, enabling robust reconstruction from sparse 16×16 transients to 64×64 videos at 10 FPS. The approach excels on synthetic and real-measured data, outperforming state-of-the-art methods in reconstruction quality and frame-rate practicality, and demonstrates practical potential for autonomous navigation and post-disaster search. The work also provides a large synthetic dataset and codebase to advance NLOS videography research.

Abstract

High quality and high speed videography using Non-Line-of-Sight (NLOS) imaging benefit autonomous navigation, collision prevention, and post-disaster search and rescue tasks. Current solutions have to balance between the frame rate and image quality. High frame rates, for example, can be achieved by reducing either per-point scanning time or scanning density, but at the cost of lowering the information density at individual frames. Fast scanning process further reduces the signal-to-noise ratio and different scanning systems exhibit different distortion characteristics. In this work, we design and employ a new Transient Transformer architecture called TransiT to achieve real-time NLOS recovery under fast scans. TransiT directly compresses the temporal dimension of input transients to extract features, reducing computation costs and meeting high frame rate requirements. It further adopts a feature fusion mechanism as well as employs a spatial-temporal Transformer to help capture features of NLOS transient videos. Moreover, TransiT applies transfer learning to bridge the gap between synthetic and real-measured data. In real experiments, TransiT manages to reconstruct from sparse transients of measured at an exposure time of 0.4 ms per point to NLOS videos at a resolution at 10 frames per second. We will make our code and dataset available to the community.

Paper Structure

This paper contains 10 sections, 11 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: NLOS videography. An NLOS imaging system captures transients of a moving hidden object (e.g., a walking person) via a relay surface. TransiT is capable of reconstructing high quality NLOS video of a person at 10 FPS using fast and sparse scanning.
  • Figure 2: Distortion model under fast scanning. (a) System configuration for distortions under fast scanning. Due to the limited galvanometer's speed, illumination and detection scan on the relay wall in a linear form rather than at a single point. (b) Images reconstructed using f-k lindell2019fk from three different transients as input. Dense: $64\times 64$ grid with 2 ms per point, distortion-free. Distorted: $16\times 16$ grid with 0.4 ms per point. Simulated Distortion: $16 \times 16$ grid generated by applying our distortion model to the transients of scanning points picked from the Dense.
  • Figure 3: Pipeline of TransiT. TransiT is a transformer-based architecture with the transients of current and previous frames as input. Transient compression extracts features by compressing the input transients along the temporal axis. Feature fusion combines the current frame's features with the difference between the current and previous frame features. ViT blocks with spatial-temporal attention then process the fused features. Followed by a linear layer, TransiT outputs a high resolution reconstruction.
  • Figure 4: System setup. (a) Our NLOS imaging system, and (b) The fast scanning pattern.
  • Figure 5: Comparison of synthetic results. From top to bottom: Ground truth, ours, f-k and PnP. The results are reconstructed across multiple frames of 16×16 of noisy synthetic data for different objects — a character 'K', a propeller, and a human.
  • ...and 2 more figures