Table of Contents
Fetching ...

OwlSight: A Robust Illumination Adaptation Framework for Dark Video Human Action Recognition

Shihao Cheng, Jinlu Zhang, Yue Liu, Zhigang Tu

TL;DR

OwlSight addresses the challenge of recognizing actions in dark videos by integrating illumination information throughout the training process with three modules: Time-Consistency Module (TCM), Luminance Adaptation Module (LAM), and Reflect Augmentation Module (RAM). The approach, trained end-to-end, dynamically adapts brightness, preserves temporal coherence, and leverages dual illumination pathways to maximize light usage. A large-scale Dark-101 dataset is introduced to support robust learning in diverse, very low-light scenarios, and OwlSight demonstrates state-of-the-art results across four benchmarks, notably outperforming prior methods on ARID1.5 and Dark-101. The work highlights the practical impact of holistic illumination-aware learning for real-world dark-environment video analysis, with strong gains from temporal consistency and adaptive illumination mechanisms.

Abstract

Human action recognition in low-light environments is crucial for various real-world applications. However, the existing approaches overlook the full utilization of brightness information throughout the training phase, leading to suboptimal performance. To address this limitation, we propose OwlSight, a biomimetic-inspired framework with whole-stage illumination enhancement to interact with action classification for accurate dark video human action recognition. Specifically, OwlSight incorporates a Time-Consistency Module (TCM) to capture shallow spatiotemporal features meanwhile maintaining temporal coherence, which are then processed by a Luminance Adaptation Module (LAM) to dynamically adjust the brightness based on the input luminance distribution. Furthermore, a Reflect Augmentation Module (RAM) is presented to maximize illumination utilization and simultaneously enhance action recognition via two interactive paths. Additionally, we build Dark-101, a large-scale dataset comprising 18,310 dark videos across 101 action categories, significantly surpassing existing datasets (e.g., ARID1.5 and Dark-48) in scale and diversity. Extensive experiments demonstrate that the proposed OwlSight achieves state-of-the-art performance across four low-light action recognition benchmarks. Notably, it outperforms previous best approaches by 5.36% on ARID1.5 and 1.72% on Dark-101, highlighting its effectiveness in challenging dark environments.

OwlSight: A Robust Illumination Adaptation Framework for Dark Video Human Action Recognition

TL;DR

OwlSight addresses the challenge of recognizing actions in dark videos by integrating illumination information throughout the training process with three modules: Time-Consistency Module (TCM), Luminance Adaptation Module (LAM), and Reflect Augmentation Module (RAM). The approach, trained end-to-end, dynamically adapts brightness, preserves temporal coherence, and leverages dual illumination pathways to maximize light usage. A large-scale Dark-101 dataset is introduced to support robust learning in diverse, very low-light scenarios, and OwlSight demonstrates state-of-the-art results across four benchmarks, notably outperforming prior methods on ARID1.5 and Dark-101. The work highlights the practical impact of holistic illumination-aware learning for real-world dark-environment video analysis, with strong gains from temporal consistency and adaptive illumination mechanisms.

Abstract

Human action recognition in low-light environments is crucial for various real-world applications. However, the existing approaches overlook the full utilization of brightness information throughout the training phase, leading to suboptimal performance. To address this limitation, we propose OwlSight, a biomimetic-inspired framework with whole-stage illumination enhancement to interact with action classification for accurate dark video human action recognition. Specifically, OwlSight incorporates a Time-Consistency Module (TCM) to capture shallow spatiotemporal features meanwhile maintaining temporal coherence, which are then processed by a Luminance Adaptation Module (LAM) to dynamically adjust the brightness based on the input luminance distribution. Furthermore, a Reflect Augmentation Module (RAM) is presented to maximize illumination utilization and simultaneously enhance action recognition via two interactive paths. Additionally, we build Dark-101, a large-scale dataset comprising 18,310 dark videos across 101 action categories, significantly surpassing existing datasets (e.g., ARID1.5 and Dark-48) in scale and diversity. Extensive experiments demonstrate that the proposed OwlSight achieves state-of-the-art performance across four low-light action recognition benchmarks. Notably, it outperforms previous best approaches by 5.36% on ARID1.5 and 1.72% on Dark-101, highlighting its effectiveness in challenging dark environments.

Paper Structure

This paper contains 25 sections, 16 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Left: Overview of our OwlSight. In contrast to the previous methods, we place a strong emphasis on fully exploring and utilizing illumination information throughout all stages interactively to boost the performance of dark video action recognition. Right: Accuracy (Top-1 Accuracy) and efficiency (parameters) comparison with the state-of-the-arts on the ARID1.5 xu2021arid dataset.
  • Figure 2: Left: Nocturnal animals dilate pupils at night for capturing enhanced light and constrict them during the day to avoid overexposure. Right: The tapetum lucidum, a reflective retinal layer, reflects light to improve illumination and visual clarity.
  • Figure 3: The Overall Architecture of OwlSight. OwlSight consists of three key components: the Time-Consistency Module (TCM), the Luminance Adaptation Module (LAM), and the Reflect Augmentation Module (RAM). The TCM extracts shallow spatiotemporal features to preserve temporal coherence across the sequence. The LAM adaptively adjusts global brightness based on the input luminance distribution. The RAM further enhances available illumination meanwhile improving action recognition through two interactive pathways. The Dark-Enhanced (D-E) consistency of the video frames, which is beneficial for video action recognition in low-light environments, can be preserved by comparing the RGB-difference before and after dark video enhancement.
  • Figure 4: Architecture of the Time-Consistency Module.
  • Figure 5: Comparison of RGB-Difference of our OwlSight with and without temporal constraint (TCM and $L_{TC}$) on the ARID1.5 dataset. Results are as follows: The upper part shows the RGB-Difference of OwlSight without temporal constraint, which are failed to improve the action recognition performance in dark videos. The lower part shows the RGB-Difference of OwlSight with temporal constrain. Obviously, the introduced temporal constraint strategy significantly boosts the performance of action recognition in dark videos.
  • ...and 3 more figures