Table of Contents
Fetching ...

Action-Agnostic Point-Level Supervision for Temporal Action Detection

Shuhei M. Yoshida, Takashi Shibata, Makoto Terao, Takayuki Okatani, Masashi Sugiyama

TL;DR

The paper addresses the high annotation cost of temporal action detection by introducing Action-Agnostic Point-Level (AAPL) supervision, where a small, automatically selected set of frames is human-labeled with action categories. It presents an end-to-end detection framework that combines snippet-based scoring, a two-headed prediction module, and a trio of losses, including a prototype-anchored contrastive loss and ground-truth anchored pseudo-labeling to exploit unlabeled data. Empirical results across five diverse datasets show that AAPL is competitive with or surpasses both video-level and point-level supervision for similar annotation budgets, often with substantially lower labeling effort. The findings support AAPL as a practical, scalable weak supervision paradigm for temporal action localization and offer actionable guidance on frame sampling and loss design to maximize cost-efficiency and accuracy.

Abstract

We propose action-agnostic point-level (AAPL) supervision for temporal action detection to achieve accurate action instance detection with a lightly annotated dataset. In the proposed scheme, a small portion of video frames is sampled in an unsupervised manner and presented to human annotators, who then label the frames with action categories. Unlike point-level supervision, which requires annotators to search for every action instance in an untrimmed video, frames to annotate are selected without human intervention in AAPL supervision. We also propose a detection model and learning method to effectively utilize the AAPL labels. Extensive experiments on the variety of datasets (THUMOS '14, FineAction, GTEA, BEOID, and ActivityNet 1.3) demonstrate that the proposed approach is competitive with or outperforms prior methods for video-level and point-level supervision in terms of the trade-off between the annotation cost and detection performance.

Action-Agnostic Point-Level Supervision for Temporal Action Detection

TL;DR

The paper addresses the high annotation cost of temporal action detection by introducing Action-Agnostic Point-Level (AAPL) supervision, where a small, automatically selected set of frames is human-labeled with action categories. It presents an end-to-end detection framework that combines snippet-based scoring, a two-headed prediction module, and a trio of losses, including a prototype-anchored contrastive loss and ground-truth anchored pseudo-labeling to exploit unlabeled data. Empirical results across five diverse datasets show that AAPL is competitive with or surpasses both video-level and point-level supervision for similar annotation budgets, often with substantially lower labeling effort. The findings support AAPL as a practical, scalable weak supervision paradigm for temporal action localization and offer actionable guidance on frame sampling and loss design to maximize cost-efficiency and accuracy.

Abstract

We propose action-agnostic point-level (AAPL) supervision for temporal action detection to achieve accurate action instance detection with a lightly annotated dataset. In the proposed scheme, a small portion of video frames is sampled in an unsupervised manner and presented to human annotators, who then label the frames with action categories. Unlike point-level supervision, which requires annotators to search for every action instance in an untrimmed video, frames to annotate are selected without human intervention in AAPL supervision. We also propose a detection model and learning method to effectively utilize the AAPL labels. Extensive experiments on the variety of datasets (THUMOS '14, FineAction, GTEA, BEOID, and ActivityNet 1.3) demonstrate that the proposed approach is competitive with or outperforms prior methods for video-level and point-level supervision in terms of the trade-off between the annotation cost and detection performance.
Paper Structure (35 sections, 10 equations, 3 figures, 11 tables, 1 algorithm)

This paper contains 35 sections, 10 equations, 3 figures, 11 tables, 1 algorithm.

Figures (3)

  • Figure 1: Illustration of ground-truth (full supervision), point-level supervision, and AAPL supervision. The red boxes and lines represents the frames labeled as "Volleyball Spiking", and the black lines represents those labeled as "Background". The images are from a video in THUMOS '14 Jiang_THUMOS14_2014.
  • Figure 2: Illustration of the model and the loss functions.
  • Figure 3: Trade-off between detection performance and annotation time. The blue squares represent AAPL-supervised training with $L_{\mathrm{pt}}$ only, the red circles represent that with our full objective, the gray diamonds represent video-level methods Paul_W-TALC_ECCV_2018Min_A2CL-PT_ECCV_2020Qu_ACM-Net_arXiv_2021Huang_RSKP_CVPR_2022Chen_DELU_ECCV_2022Wang_AHLM_ICCV_2023, and the yellow triangles represent point-level methods Ma_SF-Net_ECCV_2020Ju_Point-Level_ICCV_2021Lee_LACP_ICCV_2021Li_PCL_ESWA_2023.