Table of Contents
Fetching ...

Pose-Aware Weakly-Supervised Action Segmentation

Seth Z. Zhao, Reza Ghoddoosian, Isht Dwivedi, Nakul Agarwal, Behzad Dariush

TL;DR

The paper tackles weakly-supervised action segmentation in long instructional videos lacking frame-level labels. It introduces a training-time pose encoder and a pose-guided contrastive loss to distill pose knowledge into an RGB encoder, enabling RGB-only inference at test time. The method delivers state-of-the-art results across online and offline settings on the AT A, IKEA, and Desktop Assembly datasets and remains robust to different pose extractors and backbone architectures. By reducing labeling costs while improving temporal boundary detection, the approach offers practical benefits for real-time instructional video understanding and deployment in resource-constrained environments. The training objective combines a pose-based contrastive loss with a segmentation loss, i.e., $\mathcal{L}_{Final} = \mathcal{L}_{con} + \mathcal{L}_{segment}$, and explicitly leverages pose similarities via $\mathcal{L}_{con} = \mathcal{L}_{I2P} + \mathcal{L}_{P2I}$ with pose-distance based negative mining $d_{t,j}=|\overline{p}_t-\overline{p}_j|$ and threshold $\delta$.

Abstract

Understanding human behavior is an important problem in the pursuit of visual intelligence. A challenge in this endeavor is the extensive and costly effort required to accurately label action segments. To address this issue, we consider learning methods that demand minimal supervision for segmentation of human actions in long instructional videos. Specifically, we introduce a weakly-supervised framework that uniquely incorporates pose knowledge during training while omitting its use during inference, thereby distilling pose knowledge pertinent to each action component. We propose a pose-inspired contrastive loss as a part of the whole weakly-supervised framework which is trained to distinguish action boundaries more effectively. Our approach, validated through extensive experiments on representative datasets, outperforms previous state-of-the-art (SOTA) in segmenting long instructional videos under both online and offline settings. Additionally, we demonstrate the framework's adaptability to various segmentation backbones and pose extractors across different datasets.

Pose-Aware Weakly-Supervised Action Segmentation

TL;DR

The paper tackles weakly-supervised action segmentation in long instructional videos lacking frame-level labels. It introduces a training-time pose encoder and a pose-guided contrastive loss to distill pose knowledge into an RGB encoder, enabling RGB-only inference at test time. The method delivers state-of-the-art results across online and offline settings on the AT A, IKEA, and Desktop Assembly datasets and remains robust to different pose extractors and backbone architectures. By reducing labeling costs while improving temporal boundary detection, the approach offers practical benefits for real-time instructional video understanding and deployment in resource-constrained environments. The training objective combines a pose-based contrastive loss with a segmentation loss, i.e., , and explicitly leverages pose similarities via with pose-distance based negative mining and threshold .

Abstract

Understanding human behavior is an important problem in the pursuit of visual intelligence. A challenge in this endeavor is the extensive and costly effort required to accurately label action segments. To address this issue, we consider learning methods that demand minimal supervision for segmentation of human actions in long instructional videos. Specifically, we introduce a weakly-supervised framework that uniquely incorporates pose knowledge during training while omitting its use during inference, thereby distilling pose knowledge pertinent to each action component. We propose a pose-inspired contrastive loss as a part of the whole weakly-supervised framework which is trained to distinguish action boundaries more effectively. Our approach, validated through extensive experiments on representative datasets, outperforms previous state-of-the-art (SOTA) in segmenting long instructional videos under both online and offline settings. Additionally, we demonstrate the framework's adaptability to various segmentation backbones and pose extractors across different datasets.

Paper Structure

This paper contains 22 sections, 5 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Framework overview: Pose information is used exclusively during training. During inference, only image input is considered, omitting the pose branch.
  • Figure 2: Two methods to mine positive and negative frames for a given anchor in our contrastive learning framework.
  • Figure 3: Sensitivity analysis of $L_{con\text{-pose}}$ on DPGhoddoosian_2023_ICCV online segmentation and TASLtasl offline segmentation on both ATAGhoddoosian_2023_ICCV and Desktopdesktop datasets. Note that x-axis represents threshold value and y-axis represents results of acc and IoU.
  • Figure 4: Visualization of zero-shot pose extraction results on both ATA Dataset and Desktop Assembly Dataset. Note that first row and second row represent RTMPose Body2D (sparse keypoints) and RTMPose WholeBody2D (dense keypoints) results, respectively. Compared to the Body2D keypoints, Wholebody2D keypoints have 116 additional keypoints on hands and face.
  • Figure 5: Qualitative results of our pose-based contrastive learning in online (left) and offline (right) segmentation. Understanding fine-grained human pose results in more accurate detection of action boundaries at test time.