Action Segmentation Using 2D Skeleton Heatmaps and Multi-Modality Fusion
Syed Waleed Hyder, Muhammad Usama, Anas Zafar, Muhammad Naufil, Fawad Javed Fateh, Andrey Konin, M. Zeeshan Zia, Quoc-Huy Tran
TL;DR
This work tackles fine-grained action segmentation by substituting 2D skeleton heatmaps for traditional 3D coordinates and processing them with Temporal Convolutional Networks to capture spatiotemporal patterns. It introduces joint and limb heatmaps derived from 2D skeletons, converts them into CNN-friendly inputs, and uses MS-TCN++ for framewise segmentation, complemented by a multi-stage fusion pathway that combines RGB with 2D heatmaps. The key contributions are (1) demonstrating competitive performance and robustness to missing keypoints with 2D heatmap inputs, (2) establishing a effective 2D skeleton+RGB fusion framework across multiple stages, and (3) providing comprehensive experiments on UW-IOM, TUM-Kitchen, and Desktop Assembly showing strong improvements over 3D-skeleton-based methods. The approach offers practical benefits by avoiding depth estimation requirements and enabling seamless fusion with RGB data, suggesting strong applicability to real-world, multi-modal action understanding tasks.
Abstract
This paper presents a 2D skeleton-based action segmentation method with applications in fine-grained human activity recognition. In contrast with state-of-the-art methods which directly take sequences of 3D skeleton coordinates as inputs and apply Graph Convolutional Networks (GCNs) for spatiotemporal feature learning, our main idea is to use sequences of 2D skeleton heatmaps as inputs and employ Temporal Convolutional Networks (TCNs) to extract spatiotemporal features. Despite lacking 3D information, our approach yields comparable/superior performances and better robustness against missing keypoints than previous methods on action segmentation datasets. Moreover, we improve the performances further by using both 2D skeleton heatmaps and RGB videos as inputs. To our best knowledge, this is the first work to utilize 2D skeleton heatmap inputs and the first work to explore 2D skeleton+RGB fusion for action segmentation.
