TransiT: Transient Transformer for Non-line-of-sight Videography
Ruiqian Li, Siyuan Shen, Suan Xia, Ziheng Wang, Xingyue Peng, Chengxuan Song, Yingsheng Zhu, Tao Wu, Shiying Li, Jingyi Yu
TL;DR
TransiT tackles the challenge of real-time, high-quality NLOS videography under fast-scan distortions by introducing a Transient Transformer that directly compresses temporal transients and fuses frame-difference features within a spatiotemporal Transformer. A distortion model under fast scanning is paired with a two-stage training regime, including an MMD-based transfer loss, enabling robust reconstruction from sparse 16×16 transients to 64×64 videos at 10 FPS. The approach excels on synthetic and real-measured data, outperforming state-of-the-art methods in reconstruction quality and frame-rate practicality, and demonstrates practical potential for autonomous navigation and post-disaster search. The work also provides a large synthetic dataset and codebase to advance NLOS videography research.
Abstract
High quality and high speed videography using Non-Line-of-Sight (NLOS) imaging benefit autonomous navigation, collision prevention, and post-disaster search and rescue tasks. Current solutions have to balance between the frame rate and image quality. High frame rates, for example, can be achieved by reducing either per-point scanning time or scanning density, but at the cost of lowering the information density at individual frames. Fast scanning process further reduces the signal-to-noise ratio and different scanning systems exhibit different distortion characteristics. In this work, we design and employ a new Transient Transformer architecture called TransiT to achieve real-time NLOS recovery under fast scans. TransiT directly compresses the temporal dimension of input transients to extract features, reducing computation costs and meeting high frame rate requirements. It further adopts a feature fusion mechanism as well as employs a spatial-temporal Transformer to help capture features of NLOS transient videos. Moreover, TransiT applies transfer learning to bridge the gap between synthetic and real-measured data. In real experiments, TransiT manages to reconstruct from sparse transients of $16 \times 16$ measured at an exposure time of 0.4 ms per point to NLOS videos at a $64 \times 64$ resolution at 10 frames per second. We will make our code and dataset available to the community.
