Action Segmentation Using 2D Skeleton Heatmaps and Multi-Modality Fusion

Syed Waleed Hyder; Muhammad Usama; Anas Zafar; Muhammad Naufil; Fawad Javed Fateh; Andrey Konin; M. Zeeshan Zia; Quoc-Huy Tran

Action Segmentation Using 2D Skeleton Heatmaps and Multi-Modality Fusion

Syed Waleed Hyder, Muhammad Usama, Anas Zafar, Muhammad Naufil, Fawad Javed Fateh, Andrey Konin, M. Zeeshan Zia, Quoc-Huy Tran

TL;DR

This work tackles fine-grained action segmentation by substituting 2D skeleton heatmaps for traditional 3D coordinates and processing them with Temporal Convolutional Networks to capture spatiotemporal patterns. It introduces joint and limb heatmaps derived from 2D skeletons, converts them into CNN-friendly inputs, and uses MS-TCN++ for framewise segmentation, complemented by a multi-stage fusion pathway that combines RGB with 2D heatmaps. The key contributions are (1) demonstrating competitive performance and robustness to missing keypoints with 2D heatmap inputs, (2) establishing a effective 2D skeleton+RGB fusion framework across multiple stages, and (3) providing comprehensive experiments on UW-IOM, TUM-Kitchen, and Desktop Assembly showing strong improvements over 3D-skeleton-based methods. The approach offers practical benefits by avoiding depth estimation requirements and enabling seamless fusion with RGB data, suggesting strong applicability to real-world, multi-modal action understanding tasks.

Abstract

This paper presents a 2D skeleton-based action segmentation method with applications in fine-grained human activity recognition. In contrast with state-of-the-art methods which directly take sequences of 3D skeleton coordinates as inputs and apply Graph Convolutional Networks (GCNs) for spatiotemporal feature learning, our main idea is to use sequences of 2D skeleton heatmaps as inputs and employ Temporal Convolutional Networks (TCNs) to extract spatiotemporal features. Despite lacking 3D information, our approach yields comparable/superior performances and better robustness against missing keypoints than previous methods on action segmentation datasets. Moreover, we improve the performances further by using both 2D skeleton heatmaps and RGB videos as inputs. To our best knowledge, this is the first work to utilize 2D skeleton heatmap inputs and the first work to explore 2D skeleton+RGB fusion for action segmentation.

Action Segmentation Using 2D Skeleton Heatmaps and Multi-Modality Fusion

TL;DR

Abstract

Paper Structure (15 sections, 5 equations, 5 figures, 6 tables)

This paper contains 15 sections, 5 equations, 5 figures, 6 tables.

Introduction
Related Work
Our Approach
2D Skeleton Heatmap
2D Skeleton-Based Action Segmentation
2D Skeleton+RGB-Based Action Segmentation
Experiments
Impacts of Different Heatmaps
Impacts of Different Features
Robustness against Missing Keypoints
Comparisons on UW-IOM
Comparisons on TUM-Kitchen
Comparisons on Desktop Assembly
Discussions
Conclusion

Figures (5)

Figure 1: Prior methods either take sequences of RGB frames (a) or sequences of 3D skeletons (b) as inputs. We propose a new approach which relies on sequences of 2D skeleton heatmaps (c). We further explore 2D skeleton+RGB fusion (d) for action segmentation, leading to performance gains.
Figure 2: Examples of 2D skeleton heatmaps.
Figure 3: (a) 2D skeleton-based action segmentation. We convert 2D skeletons into image-like heatmaps, which are passed to an RGB-based network for action segmentation, i.e., MS-TCNN++ li2020ms. (b) 2D skeleton+RGB-based action segmentation. During training, we propose fusion modules at various stages of MS-TCN++ li2020ms for deep supervision lee2015deeplyli2018deep. At testing, the segmentation predicted by the last refinement stage is considered as our output. (c) 2D skeleton+RGB fusion module.
Figure 4: Examples of missing keypoints.
Figure 5: Qualitative comparisons on Desktop Assembly (sequence 2020-04-02-150120).

Action Segmentation Using 2D Skeleton Heatmaps and Multi-Modality Fusion

TL;DR

Abstract

Action Segmentation Using 2D Skeleton Heatmaps and Multi-Modality Fusion

Authors

TL;DR

Abstract

Table of Contents

Figures (5)