Table of Contents
Fetching ...

A Spatio-temporal Aligned SUNet Model for Low-light Video Enhancement

Ruirui Lin, Nantheera Anantrasirichai, Alexandra Malyugina, David Bull

TL;DR

The paper tackles the challenge of enhancing videos captured in low-light conditions by introducing STA-SUNet, a lightweight model that fuses spatio-temporal feature alignment with a Swin Transformer–based SUNet backbone. It aligns multiple neighboring frames using a 3-level deformable convolution scheme and then reconstructs enhanced frames with an advanced SUNet that leverages windowed and shifted-window self-attention. Trained on a fully registered BVI dataset and evaluated across three datasets, STA-SUNet achieves superior average PSNR and SSIM, particularly excelling in extreme low-light scenarios. The approach improves temporal consistency while preserving detail, offering practical benefits for surveillance, autonomous systems, and video analytics in challenging lighting. The work also demonstrates the importance of fully registered data for reliable training and evaluation of video restoration models, and it highlights the trade-offs between the number of input frames and computational requirements.

Abstract

Distortions caused by low-light conditions are not only visually unpleasant but also degrade the performance of computer vision tasks. The restoration and enhancement have proven to be highly beneficial. However, there are only a limited number of enhancement methods explicitly designed for videos acquired in low-light conditions. We propose a Spatio-Temporal Aligned SUNet (STA-SUNet) model using a Swin Transformer as a backbone to capture low light video features and exploit their spatio-temporal correlations. The STA-SUNet model is trained on a novel, fully registered dataset (BVI), which comprises dynamic scenes captured under varying light conditions. It is further analysed comparatively against various other models over three test datasets. The model demonstrates superior adaptivity across all datasets, obtaining the highest PSNR and SSIM values. It is particularly effective in extreme low-light conditions, yielding fairly good visualisation results.

A Spatio-temporal Aligned SUNet Model for Low-light Video Enhancement

TL;DR

The paper tackles the challenge of enhancing videos captured in low-light conditions by introducing STA-SUNet, a lightweight model that fuses spatio-temporal feature alignment with a Swin Transformer–based SUNet backbone. It aligns multiple neighboring frames using a 3-level deformable convolution scheme and then reconstructs enhanced frames with an advanced SUNet that leverages windowed and shifted-window self-attention. Trained on a fully registered BVI dataset and evaluated across three datasets, STA-SUNet achieves superior average PSNR and SSIM, particularly excelling in extreme low-light scenarios. The approach improves temporal consistency while preserving detail, offering practical benefits for surveillance, autonomous systems, and video analytics in challenging lighting. The work also demonstrates the importance of fully registered data for reliable training and evaluation of video restoration models, and it highlights the trade-offs between the number of input frames and computational requirements.

Abstract

Distortions caused by low-light conditions are not only visually unpleasant but also degrade the performance of computer vision tasks. The restoration and enhancement have proven to be highly beneficial. However, there are only a limited number of enhancement methods explicitly designed for videos acquired in low-light conditions. We propose a Spatio-Temporal Aligned SUNet (STA-SUNet) model using a Swin Transformer as a backbone to capture low light video features and exploit their spatio-temporal correlations. The STA-SUNet model is trained on a novel, fully registered dataset (BVI), which comprises dynamic scenes captured under varying light conditions. It is further analysed comparatively against various other models over three test datasets. The model demonstrates superior adaptivity across all datasets, obtaining the highest PSNR and SSIM values. It is particularly effective in extreme low-light conditions, yielding fairly good visualisation results.
Paper Structure (15 sections, 2 equations, 5 figures, 5 tables)

This paper contains 15 sections, 2 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Distortions in cropped images of the 'Faces2' sequence. (Left) Normal light. (Middle) Enhanced low light (10% brightness) using histogram matching to the normal light to visualise distortions under low-light conditions. (Right) Normal light plus Gaussian noise.
  • Figure 3: Proposed STA-SUNet framework
  • Figure 4: Low light data example: from top to bottom, light levels of 10%, 20%, and 100% (normal light). from left to right, soft toys and books with faces.
  • Figure 5: Visualisation results when using 5-frame inputs comparison for low-light enhancement on cropped images of frame 102 in the 'Figures2' sequence from the BVI dataset. (Top) From left to right, 10% light level input, enhanced results, and 100% normal light groundtruth. (Bottom) From left to right, 20% light level input, enhanced results, and 100% normal light groundtruth.
  • Figure 6: Visualisation comparison between STA-SUNet and SDSD-net: from left to right, SDSD-net result, STA-SUNet result and normal light 100%.