Table of Contents
Fetching ...

SHARP: Segmentation of Hands and Arms by Range using Pseudo-Depth for Enhanced Egocentric 3D Hand Pose Estimation and Action Recognition

Wiktor Mucha, Michael Wray, Martin Kampel

TL;DR

This work tackles egocentric 3D hand pose estimation and action recognition from RGB data alone by introducing SHARP, a pseudo-depth segmentation module that uses a monocular depth estimator (DPT-Hybrid) to mask out irrelevant background based on the fixed arm-range. By feeding SHARP-segmented frames into an extended 3D hand pose network (EffHandEgoNet3D) and combining the resulting 3D hand poses with 2D object detections in a transformer-based action recognizer, the approach achieves a mean pose error of $MPJPE = 28.66$ mm and action recognition accuracy of $91.73\%$ on the H2O dataset, outperforming prior methods. Ablation studies show the method’s performance improves over unsegmented RGB, with oracle-depth depth further reducing MPJPE to $25.09$ mm, demonstrating the potential of depth-informed segmentation. The results imply that pseudo-depth, when properly integrated, can close the gap between RGB-only methods and RGB-D systems for egocentric hand pose and action understanding, enabling accurate recognition without extra sensors. The work also reports favorable inference speed and parameter efficiency compared to recent state-of-the-art methods, underscoring its practical value for AR/VR and assistive technologies.

Abstract

Hand pose represents key information for action recognition in the egocentric perspective, where the user is interacting with objects. We propose to improve egocentric 3D hand pose estimation based on RGB frames only by using pseudo-depth images. Incorporating state-of-the-art single RGB image depth estimation techniques, we generate pseudo-depth representations of the frames and use distance knowledge to segment irrelevant parts of the scene. The resulting depth maps are then used as segmentation masks for the RGB frames. Experimental results on H2O Dataset confirm the high accuracy of the estimated pose with our method in an action recognition task. The 3D hand pose, together with information from object detection, is processed by a transformer-based action recognition network, resulting in an accuracy of 91.73%, outperforming all state-of-the-art methods. Estimations of 3D hand pose result in competitive performance with existing methods with a mean pose error of 28.66 mm. This method opens up new possibilities for employing distance information in egocentric 3D hand pose estimation without relying on depth sensors.

SHARP: Segmentation of Hands and Arms by Range using Pseudo-Depth for Enhanced Egocentric 3D Hand Pose Estimation and Action Recognition

TL;DR

This work tackles egocentric 3D hand pose estimation and action recognition from RGB data alone by introducing SHARP, a pseudo-depth segmentation module that uses a monocular depth estimator (DPT-Hybrid) to mask out irrelevant background based on the fixed arm-range. By feeding SHARP-segmented frames into an extended 3D hand pose network (EffHandEgoNet3D) and combining the resulting 3D hand poses with 2D object detections in a transformer-based action recognizer, the approach achieves a mean pose error of mm and action recognition accuracy of on the H2O dataset, outperforming prior methods. Ablation studies show the method’s performance improves over unsegmented RGB, with oracle-depth depth further reducing MPJPE to mm, demonstrating the potential of depth-informed segmentation. The results imply that pseudo-depth, when properly integrated, can close the gap between RGB-only methods and RGB-D systems for egocentric hand pose and action understanding, enabling accurate recognition without extra sensors. The work also reports favorable inference speed and parameter efficiency compared to recent state-of-the-art methods, underscoring its practical value for AR/VR and assistive technologies.

Abstract

Hand pose represents key information for action recognition in the egocentric perspective, where the user is interacting with objects. We propose to improve egocentric 3D hand pose estimation based on RGB frames only by using pseudo-depth images. Incorporating state-of-the-art single RGB image depth estimation techniques, we generate pseudo-depth representations of the frames and use distance knowledge to segment irrelevant parts of the scene. The resulting depth maps are then used as segmentation masks for the RGB frames. Experimental results on H2O Dataset confirm the high accuracy of the estimated pose with our method in an action recognition task. The 3D hand pose, together with information from object detection, is processed by a transformer-based action recognition network, resulting in an accuracy of 91.73%, outperforming all state-of-the-art methods. Estimations of 3D hand pose result in competitive performance with existing methods with a mean pose error of 28.66 mm. This method opens up new possibilities for employing distance information in egocentric 3D hand pose estimation without relying on depth sensors.
Paper Structure (25 sections, 2 equations, 7 figures, 3 tables)

This paper contains 25 sections, 2 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Overview of our method. In the sequence of input frames $f_1, f_2, f_3\dots f_n$ representing the action, SHARP improves the estimation of the 3D hand pose $Ph^{3D}_{L,R,n}$. The bounding box of the manipulated objects $Po^{2D}_{n}$ with their labels $Po_{l}$ are retrieved using YOLOv7wang2022yolov7. Pose information is embedded in a vector describing each frame. The sequence of vectors is processed by the transformer-based network to predict action.
  • Figure 2: Overview of the proposed egocentric 3D hand pose estimation method. First, the RGB image is processed with the SHARP module. Within SHARP, the pseudo-depth image is generated using the DPT-Hybrid. This distance representation is used to remove irrelevant scene information using a fixed threshold of the human arm range $t$. Secondly, the SHARP output is passed through a 3D hand pose estimation network.
  • Figure 3: Our action recognition procedure. From the sequence of frames $f_1, f_2, f_3 ... f_n$ the hand pose $Ph^{3D}_{L,R}$ is estimated with SHARP and EffHandEgoNet3D model and the object pose $Po^{2D}$, $Po_{l}$ is extracted with YOLOv7wang2022yolov7. Each sequence frame $f_n$ is linearised and positional embedding and classification tokens are added. Next, this sequence is passed to a transformer encoder dosovitskiy2020image repeated $\times2$ times, which embeds the temporal information. Finally, the MLP predicts one of the 36 action labels.
  • Figure 4: Qualitative results of our method in 2D and 3D space. Green skeletons represent the ground truth hand pose, red estimations withoutSHARP and blue estimations withSHARP. Images are annotated with a predicted action label for the represented sequences. Two examples from the left show that SHARP improves 3D pose estimation. On the right, the 3D error increases as SHARP partially loses the right hand.
  • Figure 5: Inference time for 3D hand pose estimation per single frame and action recognition accuracy per single action of state-of-the-art methods on H2O Dataset. Each method is visualised as a circle whose size represents the number of trainable parameters. SHARP inference is $\approx\times2.5$ faster than H2OTR cho2023transformer with better action recognition.
  • ...and 2 more figures