Table of Contents
Fetching ...

Dark Transformer: A Video Transformer for Action Recognition in the Dark

Anwaar Ulhaq

TL;DR

Dark Transformer tackles action recognition in low-light environments by learning cross-domain knowledge through a domain-invariant video transformer with three weight-sharing branches processed on paired day–night data $(X_1,X_2)$. It extends Timesformer with space–time self-attention, space–time cross attention, and knowledge distillation to align domain distributions while preserving discriminative features. The approach achieves state-of-the-art results on InfAR, XD145, and ARID, significantly outperforming CNN-based and two-stream architectures, and ablations show space–time attention is crucial. This work enables robust action recognition in adverse lighting with practical implications for visual surveillance and nighttime autonomous systems.

Abstract

Recognizing human actions in adverse lighting conditions presents significant challenges in computer vision, with wide-ranging applications in visual surveillance and nighttime driving. Existing methods tackle action recognition and dark enhancement separately, limiting the potential for end-to-end learning of spatiotemporal representations for video action classification. This paper introduces Dark Transformer, a novel video transformer-based approach for action recognition in low-light environments. Dark Transformer leverages spatiotemporal self-attention mechanisms in cross-domain settings to enhance cross-domain action recognition. By extending video transformers to learn cross-domain knowledge, Dark Transformer achieves state-of-the-art performance on benchmark action recognition datasets, including InFAR, XD145, and ARID. The proposed approach demonstrates significant promise in addressing the challenges of action recognition in adverse lighting conditions, offering practical implications for real-world applications.

Dark Transformer: A Video Transformer for Action Recognition in the Dark

TL;DR

Dark Transformer tackles action recognition in low-light environments by learning cross-domain knowledge through a domain-invariant video transformer with three weight-sharing branches processed on paired day–night data . It extends Timesformer with space–time self-attention, space–time cross attention, and knowledge distillation to align domain distributions while preserving discriminative features. The approach achieves state-of-the-art results on InfAR, XD145, and ARID, significantly outperforming CNN-based and two-stream architectures, and ablations show space–time attention is crucial. This work enables robust action recognition in adverse lighting with practical implications for visual surveillance and nighttime autonomous systems.

Abstract

Recognizing human actions in adverse lighting conditions presents significant challenges in computer vision, with wide-ranging applications in visual surveillance and nighttime driving. Existing methods tackle action recognition and dark enhancement separately, limiting the potential for end-to-end learning of spatiotemporal representations for video action classification. This paper introduces Dark Transformer, a novel video transformer-based approach for action recognition in low-light environments. Dark Transformer leverages spatiotemporal self-attention mechanisms in cross-domain settings to enhance cross-domain action recognition. By extending video transformers to learn cross-domain knowledge, Dark Transformer achieves state-of-the-art performance on benchmark action recognition datasets, including InFAR, XD145, and ARID. The proposed approach demonstrates significant promise in addressing the challenges of action recognition in adverse lighting conditions, offering practical implications for real-world applications.
Paper Structure (11 sections, 7 equations, 8 figures, 2 tables)

This paper contains 11 sections, 7 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Challenges in Automated Action Recognition: Capturing Examples of Actions (Push, Walk, Handshaking, Drink) in the Dark from RGB and Infrared Domains
  • Figure 2: Architectural diagram of the proposed Dark Transformer Transformer during training. The model inputs source (daytime) and target domain (nighttime) videos. However, during inference, only the target domain videos are utilised as a single modality for action recognition.
  • Figure 3: The architectural diagram showcases the three weight-sharing branches of the Dark Transformer. The middle branch, highlighted in this diagram, plays a crucial role by providing accurate alignment through space-time cross-domain attention.
  • Figure 4: Examples of visible and infrared actions. Each subfigure displays a visible image (left) from the XD145 dataset's video sequences and an infrared image (right) from the InfAR dataset's video sequences.
  • Figure 5: Sample frames illustrating each of the 11 action classes from the ARID dataset.
  • ...and 3 more figures