Table of Contents
Fetching ...

Modulo Video Recovery via Selective Spatiotemporal Vision Transformer

Tianyu Geng, Feng Ji, Wee Peng Tay

TL;DR

This paper tackles the problem of recovering HDR-quality video from modulo-captured frames, where traditional HDR methods fail due to discrete folding of pixel values. It introduces SSViT, a Selective Spatiotemporal Vision Transformer that handles modulo data by iteratively predicting folding masks and unwrapping frames, aided by a token selection mechanism based on a Neighboring Similarity Matrix. The approach combines a shared encoder, a spatiotemporal Transformer backbone with removed positional embeddings, a folding-number decoder, and optical-flow-based completion for unselected tokens, enabling robust long-range spatial-temporal reasoning. Experimental results on synthetic and HdM datasets show that SSViT achieves state-of-the-art performance on 8-bit modulo inputs and can reconstruct 12-bit HDR frames, highlighting its potential for high-dynamic-range video applications and real-world modulo cameras. The work lays a foundation for transformer-based modulo recovery and points to future improvements in robustness to heavy folding and real-world sensor integration.

Abstract

Conventional image sensors have limited dynamic range, causing saturation in high-dynamic-range (HDR) scenes. Modulo cameras address this by folding incident irradiance into a bounded range, yet require specialized unwrapping algorithms to reconstruct the underlying signal. Unlike HDR recovery, which extends dynamic range from conventional sampling, modulo recovery restores actual values from folded samples. Despite being introduced over a decade ago, progress in modulo image recovery has been slow, especially in the use of modern deep learning techniques. In this work, we demonstrate that standard HDR methods are unsuitable for modulo recovery. Transformers, however, can capture global dependencies and spatial-temporal relationships crucial for resolving folded video frames. Still, adapting existing Transformer architectures for modulo recovery demands novel techniques. To this end, we present Selective Spatiotemporal Vision Transformer (SSViT), the first deep learning framework for modulo video reconstruction. SSViT employs a token selection strategy to improve efficiency and concentrate on the most critical regions. Experiments confirm that SSViT produces high-quality reconstructions from 8-bit folded videos and achieves state-of-the-art performance in modulo video recovery.

Modulo Video Recovery via Selective Spatiotemporal Vision Transformer

TL;DR

This paper tackles the problem of recovering HDR-quality video from modulo-captured frames, where traditional HDR methods fail due to discrete folding of pixel values. It introduces SSViT, a Selective Spatiotemporal Vision Transformer that handles modulo data by iteratively predicting folding masks and unwrapping frames, aided by a token selection mechanism based on a Neighboring Similarity Matrix. The approach combines a shared encoder, a spatiotemporal Transformer backbone with removed positional embeddings, a folding-number decoder, and optical-flow-based completion for unselected tokens, enabling robust long-range spatial-temporal reasoning. Experimental results on synthetic and HdM datasets show that SSViT achieves state-of-the-art performance on 8-bit modulo inputs and can reconstruct 12-bit HDR frames, highlighting its potential for high-dynamic-range video applications and real-world modulo cameras. The work lays a foundation for transformer-based modulo recovery and points to future improvements in robustness to heavy folding and real-world sensor integration.

Abstract

Conventional image sensors have limited dynamic range, causing saturation in high-dynamic-range (HDR) scenes. Modulo cameras address this by folding incident irradiance into a bounded range, yet require specialized unwrapping algorithms to reconstruct the underlying signal. Unlike HDR recovery, which extends dynamic range from conventional sampling, modulo recovery restores actual values from folded samples. Despite being introduced over a decade ago, progress in modulo image recovery has been slow, especially in the use of modern deep learning techniques. In this work, we demonstrate that standard HDR methods are unsuitable for modulo recovery. Transformers, however, can capture global dependencies and spatial-temporal relationships crucial for resolving folded video frames. Still, adapting existing Transformer architectures for modulo recovery demands novel techniques. To this end, we present Selective Spatiotemporal Vision Transformer (SSViT), the first deep learning framework for modulo video reconstruction. SSViT employs a token selection strategy to improve efficiency and concentrate on the most critical regions. Experiments confirm that SSViT produces high-quality reconstructions from 8-bit folded videos and achieves state-of-the-art performance in modulo video recovery.

Paper Structure

This paper contains 13 sections, 18 equations, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of two approaches for obtaining HDR images: modulo recovery with modulo-sampled inputs and traditional HDR recovery with single or multi-exposure sampled inputs.
  • Figure 2: Comparison between traditional HDR recovery tang2023high and modulo recovery applied to a modulo input. Note that folded pixels (e.g., on the road surface, rooftops, walls, etc.) in the modulo input image are not recovered by the traditional HDR imaging method.
  • Figure 3: Illustration of token selection. The transparent yellow tubes represent the intricate areas selected by the token selection algorithm, where the folding number varies significantly and undergoes noticeable changes across nearby frames. Note that the segmentation of patches and the number of tokens selected in the illustration are for demonstration purposes only. The actual configurations may vary based on experimental requirements.
  • Figure 4: The flowchart of the proposed Selective Spatiotemporal Vision Transformer. Initially, an encoder is employed to convert all frames of the input video clip into embedding features. These embedding features serve as the basis for determining selected areas in the video clip using our token selection strategy. Following this, the spatiotemporal Transformer is utilized to grasp temporal relationships, while unselected tokens undergo optical flow estimation and mask warping operations. Ultimately, the output from the spatiotemporal Transformer is directed to a decoder to predict the folding mask.
  • Figure 5: Results on the HdM dataset. Quantitative evaluations using PSNR (dB) / SSIM are displayed below each image.