Modulo Video Recovery via Selective Spatiotemporal Vision Transformer
Tianyu Geng, Feng Ji, Wee Peng Tay
TL;DR
This paper tackles the problem of recovering HDR-quality video from modulo-captured frames, where traditional HDR methods fail due to discrete folding of pixel values. It introduces SSViT, a Selective Spatiotemporal Vision Transformer that handles modulo data by iteratively predicting folding masks and unwrapping frames, aided by a token selection mechanism based on a Neighboring Similarity Matrix. The approach combines a shared encoder, a spatiotemporal Transformer backbone with removed positional embeddings, a folding-number decoder, and optical-flow-based completion for unselected tokens, enabling robust long-range spatial-temporal reasoning. Experimental results on synthetic and HdM datasets show that SSViT achieves state-of-the-art performance on 8-bit modulo inputs and can reconstruct 12-bit HDR frames, highlighting its potential for high-dynamic-range video applications and real-world modulo cameras. The work lays a foundation for transformer-based modulo recovery and points to future improvements in robustness to heavy folding and real-world sensor integration.
Abstract
Conventional image sensors have limited dynamic range, causing saturation in high-dynamic-range (HDR) scenes. Modulo cameras address this by folding incident irradiance into a bounded range, yet require specialized unwrapping algorithms to reconstruct the underlying signal. Unlike HDR recovery, which extends dynamic range from conventional sampling, modulo recovery restores actual values from folded samples. Despite being introduced over a decade ago, progress in modulo image recovery has been slow, especially in the use of modern deep learning techniques. In this work, we demonstrate that standard HDR methods are unsuitable for modulo recovery. Transformers, however, can capture global dependencies and spatial-temporal relationships crucial for resolving folded video frames. Still, adapting existing Transformer architectures for modulo recovery demands novel techniques. To this end, we present Selective Spatiotemporal Vision Transformer (SSViT), the first deep learning framework for modulo video reconstruction. SSViT employs a token selection strategy to improve efficiency and concentrate on the most critical regions. Experiments confirm that SSViT produces high-quality reconstructions from 8-bit folded videos and achieves state-of-the-art performance in modulo video recovery.
