Table of Contents
Fetching ...

Stable Mean Teacher for Semi-supervised Video Action Detection

Akash Kumar, Sirshapan Mitra, Yogesh Singh Rawat

TL;DR

This work tackles the challenge of semi-supervised video action detection by extending the Mean Teacher paradigm with two novel components. The Stable Mean Teacher framework introduces an Error Recovery (EoR) module that learns from the student’s mistakes on labeled data and refines the teacher’s pseudo-labels, and a Difference of Pixels (DoP) constraint that enforces temporal coherence in spatio-temporal predictions. The approach yields substantial gains over supervised baselines across four benchmarks (UCF101-24, JHMDB21, AVA, YouTube-VOS), including strong performance in low-label regimes and demonstrated generalization to video object segmentation. The combination of EMA-based teacher updates, class-agnostic error refinement, and temporal consistency constraints produces high-quality pseudo-labels and robust action localization in challenging video data, with public code and models provided for reproducibility.

Abstract

In this work, we focus on semi-supervised learning for video action detection. Video action detection requires spatiotemporal localization in addition to classification, and a limited amount of labels makes the model prone to unreliable predictions. We present Stable Mean Teacher, a simple end-to-end teacher-based framework that benefits from improved and temporally consistent pseudo labels. It relies on a novel Error Recovery (EoR) module, which learns from students' mistakes on labeled samples and transfers this knowledge to the teacher to improve pseudo labels for unlabeled samples. Moreover, existing spatiotemporal losses do not take temporal coherency into account and are prone to temporal inconsistencies. To address this, we present Difference of Pixels (DoP), a simple and novel constraint focused on temporal consistency, leading to coherent temporal detections. We evaluate our approach on four different spatiotemporal detection benchmarks: UCF101-24, JHMDB21, AVA, and YouTube-VOS. Our approach outperforms the supervised baselines for action detection by an average margin of 23.5% on UCF101-24, 16% on JHMDB21, and 3.3% on AVA. Using merely 10% and 20% of data, it provides competitive performance compared to the supervised baseline trained on 100% annotations on UCF101-24 and JHMDB21, respectively. We further evaluate its effectiveness on AVA for scaling to large-scale datasets and YouTube-VOS for video object segmentation, demonstrating its generalization capability to other tasks in the video domain. Code and models are publicly available.

Stable Mean Teacher for Semi-supervised Video Action Detection

TL;DR

This work tackles the challenge of semi-supervised video action detection by extending the Mean Teacher paradigm with two novel components. The Stable Mean Teacher framework introduces an Error Recovery (EoR) module that learns from the student’s mistakes on labeled data and refines the teacher’s pseudo-labels, and a Difference of Pixels (DoP) constraint that enforces temporal coherence in spatio-temporal predictions. The approach yields substantial gains over supervised baselines across four benchmarks (UCF101-24, JHMDB21, AVA, YouTube-VOS), including strong performance in low-label regimes and demonstrated generalization to video object segmentation. The combination of EMA-based teacher updates, class-agnostic error refinement, and temporal consistency constraints produces high-quality pseudo-labels and robust action localization in challenging video data, with public code and models provided for reproducibility.

Abstract

In this work, we focus on semi-supervised learning for video action detection. Video action detection requires spatiotemporal localization in addition to classification, and a limited amount of labels makes the model prone to unreliable predictions. We present Stable Mean Teacher, a simple end-to-end teacher-based framework that benefits from improved and temporally consistent pseudo labels. It relies on a novel Error Recovery (EoR) module, which learns from students' mistakes on labeled samples and transfers this knowledge to the teacher to improve pseudo labels for unlabeled samples. Moreover, existing spatiotemporal losses do not take temporal coherency into account and are prone to temporal inconsistencies. To address this, we present Difference of Pixels (DoP), a simple and novel constraint focused on temporal consistency, leading to coherent temporal detections. We evaluate our approach on four different spatiotemporal detection benchmarks: UCF101-24, JHMDB21, AVA, and YouTube-VOS. Our approach outperforms the supervised baselines for action detection by an average margin of 23.5% on UCF101-24, 16% on JHMDB21, and 3.3% on AVA. Using merely 10% and 20% of data, it provides competitive performance compared to the supervised baseline trained on 100% annotations on UCF101-24 and JHMDB21, respectively. We further evaluate its effectiveness on AVA for scaling to large-scale datasets and YouTube-VOS for video object segmentation, demonstrating its generalization capability to other tasks in the video domain. Code and models are publicly available.

Paper Structure

This paper contains 28 sections, 7 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: Performance overview: Stable Mean Teacher provides comparable performance with 10% (UCF101-24; left two plots) and 20% (JHMDB-21; right two plots) labels when compared with fully supervised approach which is trained on 100% annotations. It consistently outperforms existing state-of-the-art Kumar_2022_CVPR and supervised baseline on both f-mAP and v-mAP with good margin on both UCF101-24 and JHMDB-21 at all different percentages of labeled set. x-axis shows annotation percentage in each plot.
  • Figure 2: Overview of Stable Mean Teacher. The two key components to improve the quality of spatio-temporal pseudo label: 1) Error Recovery: refines the spatial action boundary, 2) DoP constraint: induces temporal coherency on predicted spatio-temporal pseudo labels.
  • Figure 3: Visualization of Difference of Pixels (DoP). First row shows the RGB frames, second row shows the pixel difference map of ground truth along temporal dimension. We show two scenarios: Left: Static: constant background; actor in motion, and Right: Dynamic: changing background; actor in motion. Temporal difference emphasizes on the variation of boundary pixels between consecutive frames.
  • Figure 4: Qualitative analysis for EoR and DoP: Left side illustrates the effectiveness of Error Recovery module on multiple samples, with improvement in action boundary precision and it also helps in suppressing background noise. On the right hand, we demonstrate how DoP constraint induces temporal coherency in predictions for sequence of video frames.
  • Figure 5: Analyzing Stable Mean Teacher:(Left)Static vs dynamic scenes: Dynamic scenes are challenging than static scenes, however, the relative boost in performance for dynamic is 27.7% more than in case of static scene scenario. $\Delta$ denotes relative change at v-mAP@0.5. (Middle)Annotation percent: Moving towards right to left on x-axis, the gain in performance (f-mAP@0.5) increases. It indicates the approach is more effective in low label regime. (Right)Error Recovery architectures: The performance of 3D Error Recovery architecture outperforms the 2D based architecture.
  • ...and 8 more figures