Table of Contents
Fetching ...

AG-EgoPose: Leveraging Action-Guided Motion and Kinematic Joint Encoding for Egocentric 3D Pose Estimation

Md Mushfiqur Azam, John Quarles, Kevin Desai

Abstract

Egocentric 3D human pose estimation remains challenging due to severe perspective distortion, limited body visibility, and complex camera motion inherent in first-person viewpoints. Existing methods typically rely on single-frame analysis or limited temporal fusion, which fails to effectively leverage the rich motion context available in egocentric videos. We introduce AG-EgoPose, a novel dual-stream framework that integrates short- and long-range motion context with fine-grained spatial cues for robust pose estimation from fisheye camera input. Our framework features two parallel streams: A spatial stream uses a weight-sharing ResNet-18 encoder-decoder to generate 2D joint heatmaps and corresponding joint-specific spatial feature tokens. Simultaneously, a temporal stream uses a ResNet-50 backbone to extract visual features, which are then processed by an action recognition backbone to capture the motion dynamics. These complementary representations are fused and refined in a transformer decoder with learnable joint tokens, which allows for the joint-level integration of spatial and temporal evidence while maintaining anatomical constraints. Experiments on real-world datasets demonstrate that AG-EgoPose achieves state-of-the-art performance in both quantitative and qualitative metrics. Code is available at: https://github.com/Mushfiq5647/AG-EgoPose.

AG-EgoPose: Leveraging Action-Guided Motion and Kinematic Joint Encoding for Egocentric 3D Pose Estimation

Abstract

Egocentric 3D human pose estimation remains challenging due to severe perspective distortion, limited body visibility, and complex camera motion inherent in first-person viewpoints. Existing methods typically rely on single-frame analysis or limited temporal fusion, which fails to effectively leverage the rich motion context available in egocentric videos. We introduce AG-EgoPose, a novel dual-stream framework that integrates short- and long-range motion context with fine-grained spatial cues for robust pose estimation from fisheye camera input. Our framework features two parallel streams: A spatial stream uses a weight-sharing ResNet-18 encoder-decoder to generate 2D joint heatmaps and corresponding joint-specific spatial feature tokens. Simultaneously, a temporal stream uses a ResNet-50 backbone to extract visual features, which are then processed by an action recognition backbone to capture the motion dynamics. These complementary representations are fused and refined in a transformer decoder with learnable joint tokens, which allows for the joint-level integration of spatial and temporal evidence while maintaining anatomical constraints. Experiments on real-world datasets demonstrate that AG-EgoPose achieves state-of-the-art performance in both quantitative and qualitative metrics. Code is available at: https://github.com/Mushfiq5647/AG-EgoPose.

Paper Structure

This paper contains 19 sections, 14 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Overview of our egocentric 3D pose estimation model. Egocentric fisheye video frames are processed by two parallel streams: (1) Spatial Encoder: generates 2D joint heatmaps using a weight-sharing ResNet-18 encoder–decoder with unified skip connections and encodes spatial joint features. (2) Motion Encoder: an ActionFormer actionformer based temporal encoder operates on visual features extracted using ResNet-50 to capture short- and long-term motion dynamics. Spatial and temporal features are concatenated per joint to form a joint-level memory. A transformer decoder with learnable joint queries attends to this memory, enabling joint-specific integration of spatial and temporal evidence. The decoder output is then passed through a pose head to regress 3D joint coordinates.
  • Figure 2: 2D heatmap prediction network with ResNet-18 encoder and FPN decoder using unified skip connections.
  • Figure 3: Qualitative comparison between our method and state-of-the-art egocentric 3D pose estimation methods. From left to right, we show the input image followed by the results of Mo$^{2}$Cap$^{2}$, xR-EgoPose, EgoPW, SceneEgo, and our method. The top two rows are from the SceneEgo sceneego dataset, where ground-truth poses are shown in red. The bottom two rows are from the EgoPW egopw dataset (without ground-truth poses).
  • Figure 4: Joint-wise error analysis for EgoPW and SceneEgo datasets.