Table of Contents
Fetching ...

Point-Supervised Skeleton-Based Human Action Segmentation

Hongsong Wang, Yiqin Shen, Pengbo Yan, Jie Gui

TL;DR

A point-supervised framework for skeleton-based action segmentation, where only a single frame per action segment is labeled, andMultimodal pseudo-label integration is proposed to enhance the reliability of the pseudo-label and guide the model training.

Abstract

Skeleton-based temporal action segmentation is a fundamental yet challenging task, playing a crucial role in enabling intelligent systems to perceive and respond to human activities. While fully-supervised methods achieve satisfactory performance, they require costly frame-level annotations and are sensitive to ambiguous action boundaries. To address these issues, we introduce a point-supervised framework for skeleton-based action segmentation, where only a single frame per action segment is labeled. We leverage multimodal skeleton data, including joint, bone, and motion information, encoded via a pretrained unified model to extract rich feature representations. To generate reliable pseudo-labels, we propose a novel prototype similarity method and integrate it with two existing methods: energy function and constrained K-Medoids clustering. Multimodal pseudo-label integration is proposed to enhance the reliability of the pseudo-label and guide the model training. We establish new benchmarks on PKU-MMD (X-Sub and X-View), MCFS-22, and MCFS-130, and implement baselines for point-supervised skeleton-based human action segmentation. Extensive experiments show that our method achieves competitive performance, even surpassing some fully-supervised methods while significantly reducing annotation effort.

Point-Supervised Skeleton-Based Human Action Segmentation

TL;DR

A point-supervised framework for skeleton-based action segmentation, where only a single frame per action segment is labeled, andMultimodal pseudo-label integration is proposed to enhance the reliability of the pseudo-label and guide the model training.

Abstract

Skeleton-based temporal action segmentation is a fundamental yet challenging task, playing a crucial role in enabling intelligent systems to perceive and respond to human activities. While fully-supervised methods achieve satisfactory performance, they require costly frame-level annotations and are sensitive to ambiguous action boundaries. To address these issues, we introduce a point-supervised framework for skeleton-based action segmentation, where only a single frame per action segment is labeled. We leverage multimodal skeleton data, including joint, bone, and motion information, encoded via a pretrained unified model to extract rich feature representations. To generate reliable pseudo-labels, we propose a novel prototype similarity method and integrate it with two existing methods: energy function and constrained K-Medoids clustering. Multimodal pseudo-label integration is proposed to enhance the reliability of the pseudo-label and guide the model training. We establish new benchmarks on PKU-MMD (X-Sub and X-View), MCFS-22, and MCFS-130, and implement baselines for point-supervised skeleton-based human action segmentation. Extensive experiments show that our method achieves competitive performance, even surpassing some fully-supervised methods while significantly reducing annotation effort.
Paper Structure (12 sections, 4 equations, 4 figures, 6 tables)

This paper contains 12 sections, 4 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Comparison between point-supervised annotation and fully-supervised annotation for skeleton-based action segmentation.
  • Figure 2: Overall framework of point-supervised skeleton-based temporal action segmentation via multimodal pseudo-label generation and integration.
  • Figure 3: Comparison of results of different pseudo-label generation methods on the PKU-MMD (X-sub) dataset.
  • Figure 4: Comparison of action segmentation results of different point-Supervised methods on the PKU-MMD (X-sub) dataset.