Table of Contents
Fetching ...

ProTAL: A Drag-and-Link Video Programming Framework for Temporal Action Localization

Yuchen He, Jianbing Lv, Liqi Cheng, Lingyu Meng, Dazhen Deng, Yingcai Wu

TL;DR

ProTAL presents a novel drag-and-link video programming framework for Temporal Action Localization (TAL) that decomposes complex actions into key events defined via relationships among visual elements. It combines automatic extraction of body parts and objects with an interactive graph-based interface to generate frame-level labels for unlabeled videos, which are then used in a semi-supervised TAL training pipeline. The approach is validated through a practical table tennis scenario and a controlled user study, showing improved usability, efficiency, and competitive performance against fully supervised baselines while dramatically reducing labeling effort. The work provides insights into interactive video programming design, constraint spaces for TAL, and directions for extending data programming to vision tasks.

Abstract

Temporal Action Localization (TAL) aims to detect the start and end timestamps of actions in a video. However, the training of TAL models requires a substantial amount of manually annotated data. Data programming is an efficient method to create training labels with a series of human-defined labeling functions. However, its application in TAL faces difficulties of defining complex actions in the context of temporal video frames. In this paper, we propose ProTAL, a drag-and-link video programming framework for TAL. ProTAL enables users to define \textbf{key events} by dragging nodes representing body parts and objects and linking them to constrain the relations (direction, distance, etc.). These definitions are used to generate action labels for large-scale unlabelled videos. A semi-supervised method is then employed to train TAL models with such labels. We demonstrate the effectiveness of ProTAL through a usage scenario and a user study, providing insights into designing video programming framework.

ProTAL: A Drag-and-Link Video Programming Framework for Temporal Action Localization

TL;DR

ProTAL presents a novel drag-and-link video programming framework for Temporal Action Localization (TAL) that decomposes complex actions into key events defined via relationships among visual elements. It combines automatic extraction of body parts and objects with an interactive graph-based interface to generate frame-level labels for unlabeled videos, which are then used in a semi-supervised TAL training pipeline. The approach is validated through a practical table tennis scenario and a controlled user study, showing improved usability, efficiency, and competitive performance against fully supervised baselines while dramatically reducing labeling effort. The work provides insights into interactive video programming design, constraint spaces for TAL, and directions for extending data programming to vision tasks.

Abstract

Temporal Action Localization (TAL) aims to detect the start and end timestamps of actions in a video. However, the training of TAL models requires a substantial amount of manually annotated data. Data programming is an efficient method to create training labels with a series of human-defined labeling functions. However, its application in TAL faces difficulties of defining complex actions in the context of temporal video frames. In this paper, we propose ProTAL, a drag-and-link video programming framework for TAL. ProTAL enables users to define \textbf{key events} by dragging nodes representing body parts and objects and linking them to constrain the relations (direction, distance, etc.). These definitions are used to generate action labels for large-scale unlabelled videos. A semi-supervised method is then employed to train TAL models with such labels. We demonstrate the effectiveness of ProTAL through a usage scenario and a user study, providing insights into designing video programming framework.

Paper Structure

This paper contains 44 sections, 5 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: The space of visual elements and constraints in key event definitions. Visual elements include two categories: human-related visual elements, mainly human body parts, usually represented as skeletons; and object-related visual elements, including objects involved in the action. Constraints include direction, relative distance, contact, and association constraint.
  • Figure 2: Framework of ProTAL. The first stage is (A) the automatic extraction of action-relevant visual elements. The second stage is (B) the defining of key events based on interactions, followed by (C) the generation of key event labels. The third stage is (D) the model training with a semi-supervised TAL method based on the generated labels.
  • Figure 3: System screenshot. Users can navigate the video dataset and identify key events in Dataset View (A). They can add key events in Event View (B) and define them through drag-and-link interactions in Defining View (C). The distribution of generated labels and the labeled frames can be reviewed in Dataset View and Frame View (D) to guide the refinement of definitions. Training View (E) shows the progress of TAL model training based on the generated labels.
  • Figure 4: The Dataset View contains: (A) a cell matrix, where each cell represents a video, (B) a video display module, and (C) a timeline module containing two timelines, the top one (C2) showing the label distribution and the bottom one (C1) showing the user's markers.
  • Figure 5: The Defining View contains: a timeline (A) for setting the number of states and time intervals, and a canvas featuring drag-and-link interactions. Users can drag to adjust node positions (B) and direction ranges (C), link nodes to define constraints such as direction (D), distance, and contact (E). This design also facilitates the refinement of key event definitions (F).
  • ...and 4 more figures