A Spatio-temporal Aligned SUNet Model for Low-light Video Enhancement
Ruirui Lin, Nantheera Anantrasirichai, Alexandra Malyugina, David Bull
TL;DR
The paper tackles the challenge of enhancing videos captured in low-light conditions by introducing STA-SUNet, a lightweight model that fuses spatio-temporal feature alignment with a Swin Transformer–based SUNet backbone. It aligns multiple neighboring frames using a 3-level deformable convolution scheme and then reconstructs enhanced frames with an advanced SUNet that leverages windowed and shifted-window self-attention. Trained on a fully registered BVI dataset and evaluated across three datasets, STA-SUNet achieves superior average PSNR and SSIM, particularly excelling in extreme low-light scenarios. The approach improves temporal consistency while preserving detail, offering practical benefits for surveillance, autonomous systems, and video analytics in challenging lighting. The work also demonstrates the importance of fully registered data for reliable training and evaluation of video restoration models, and it highlights the trade-offs between the number of input frames and computational requirements.
Abstract
Distortions caused by low-light conditions are not only visually unpleasant but also degrade the performance of computer vision tasks. The restoration and enhancement have proven to be highly beneficial. However, there are only a limited number of enhancement methods explicitly designed for videos acquired in low-light conditions. We propose a Spatio-Temporal Aligned SUNet (STA-SUNet) model using a Swin Transformer as a backbone to capture low light video features and exploit their spatio-temporal correlations. The STA-SUNet model is trained on a novel, fully registered dataset (BVI), which comprises dynamic scenes captured under varying light conditions. It is further analysed comparatively against various other models over three test datasets. The model demonstrates superior adaptivity across all datasets, obtaining the highest PSNR and SSIM values. It is particularly effective in extreme low-light conditions, yielding fairly good visualisation results.
