Table of Contents
Fetching ...

Seeing in the Dark: A Teacher-Student Framework for Dark Video Action Recognition via Knowledge Distillation and Contrastive Learning

Sharana Dharshikgan Suresh Dass, Hrishav Bakul Barua, Ganesh Krishnasamy, Raveendran Paramesran, Raphael C. -W. Phan

TL;DR

ActLumos tackles dark video action recognition by pairing a dual-stream teacher that fuses dark and retinex enhanced frames via Dynamic Feature Fusion with a supervised contrastive objective, with a lightweight single-stream student trained through self-supervised pretraining and knowledge distillation. The teacher transfers its multi-stream reasoning and sharp class margins to the student, enabling state-of-the-art single-stream inference on challenging datasets. The approach achieves top performance on ARID V1.0, ARID V1.5, and Dark48, while maintaining efficiency suitable for real-world deployment. This work demonstrates the value of segment-wise fusion and lighting-robust representations for practical low-light video understanding.

Abstract

Action recognition in dark or low-light (under-exposed) videos is a challenging task due to visibility degradation, which can hinder critical spatiotemporal details. This paper proposes ActLumos, a teacher-student framework that attains single-stream inference while retaining multi-stream level accuracy. The teacher consumes dual stream inputs, which include original dark frames and retinex-enhanced frames, processed by weight-shared R(2+1)D-34 backbones and fused by a Dynamic Feature Fusion (DFF) module, which dynamically re-weights the two streams at each time step, emphasising the most informative temporal segments. The teacher is also included with a supervised contrastive loss (SupCon) that sharpens class margins. The student shares the R(2+1)D-34 backbone but uses only dark frames and no fusion at test time. The student is first pre-trained with self-supervision on dark clips of both datasets without their labels and then fine-tuned with knowledge distillation from the teacher, transferring the teacher's multi-stream knowledge into a single-stream model. Under single-stream inference, the distilled student attains state-of-the-art accuracy of 96.92% (Top-1) on ARID V1.0, 88.27% on ARID V1.5, and 48.96% on Dark48. Ablation studies further highlight the individual contributions of each component, i.e., DFF in the teacher outperforms single or static fusion, knowledge distillation (KD) transfers these gains to the single-stream student, and two-view spatio-temporal SSL surpasses spatial-only or temporal-only variants without increasing inference cost. The official website of this work is available at: https://github.com/HrishavBakulBarua/ActLumos

Seeing in the Dark: A Teacher-Student Framework for Dark Video Action Recognition via Knowledge Distillation and Contrastive Learning

TL;DR

ActLumos tackles dark video action recognition by pairing a dual-stream teacher that fuses dark and retinex enhanced frames via Dynamic Feature Fusion with a supervised contrastive objective, with a lightweight single-stream student trained through self-supervised pretraining and knowledge distillation. The teacher transfers its multi-stream reasoning and sharp class margins to the student, enabling state-of-the-art single-stream inference on challenging datasets. The approach achieves top performance on ARID V1.0, ARID V1.5, and Dark48, while maintaining efficiency suitable for real-world deployment. This work demonstrates the value of segment-wise fusion and lighting-robust representations for practical low-light video understanding.

Abstract

Action recognition in dark or low-light (under-exposed) videos is a challenging task due to visibility degradation, which can hinder critical spatiotemporal details. This paper proposes ActLumos, a teacher-student framework that attains single-stream inference while retaining multi-stream level accuracy. The teacher consumes dual stream inputs, which include original dark frames and retinex-enhanced frames, processed by weight-shared R(2+1)D-34 backbones and fused by a Dynamic Feature Fusion (DFF) module, which dynamically re-weights the two streams at each time step, emphasising the most informative temporal segments. The teacher is also included with a supervised contrastive loss (SupCon) that sharpens class margins. The student shares the R(2+1)D-34 backbone but uses only dark frames and no fusion at test time. The student is first pre-trained with self-supervision on dark clips of both datasets without their labels and then fine-tuned with knowledge distillation from the teacher, transferring the teacher's multi-stream knowledge into a single-stream model. Under single-stream inference, the distilled student attains state-of-the-art accuracy of 96.92% (Top-1) on ARID V1.0, 88.27% on ARID V1.5, and 48.96% on Dark48. Ablation studies further highlight the individual contributions of each component, i.e., DFF in the teacher outperforms single or static fusion, knowledge distillation (KD) transfers these gains to the single-stream student, and two-view spatio-temporal SSL surpasses spatial-only or temporal-only variants without increasing inference cost. The official website of this work is available at: https://github.com/HrishavBakulBarua/ActLumos

Paper Structure

This paper contains 24 sections, 13 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Self-supervised vs supervised contrastive learning. Left (self-supervised): the anchor clip (class Pick) has only its own augmented view as a positive (pink edge), all other clips in the batch are treated as negatives (green). Right (supervised): with labels, every clip from the same class Pick including dark and retinex views of different instances is a positive (pink), while clips from other classes are negatives (green).
  • Figure 2: Examples of dark frames (top) and their retinex-enhanced counterparts (middle) and gamma-corrected frames, across actions (pour, pick, walk, stand, drink). Here the dark frames are from ARID dataset.
  • Figure 3: The framework for the proposed ActLumos approach.
  • Figure 4: Illustration of the proposed Dynamic Feature Fusion (DFF) module. At each temporal step, it adaptively weighs the dark and retinex features to select the most informative representation.
  • Figure 5: Effect of unlabeled SSL pretraining source on downstream Top-1 accuracy for ARID V1.0, ARID V1.5, and Dark48. Each group compares SSL on ARID-only, Dark48-only, and ARID+Dark48 (combined). Red dashed lines denote the KD-only (no SSL) baseline for that dataset. Numbers above bars show absolute accuracy, with the improvement over KD-only shown at the bottom (near the x-axis) of the chart. Combined pretraining is best across all datasets, and in-domain SSL consistently outperforms cross-domain SSL.