Table of Contents
Fetching ...

Grounding Foundational Vision Models with 3D Human Poses for Robust Action Recognition

Nicholas Babey, Tiffany Gu, Yiheng Li, Cristian Meo, Kevin Zhu

TL;DR

This work addresses the brittleness of RGB-only action recognition in cluttered/occluded scenes by grounding actions in 3D space. It introduces a two-stream fusion architecture that combines V-JEPA 2's contextual world dynamics with CoMotion's explicit 3D human poses using bidirectional cross-attention, followed by self-attention and a final MLP classifier. The approach yields superior, occlusion-robust action recognition on InHARD and the occlusion-focused UCF-19-Y-OCC, outperforming V-JEPA 2, CoMotion, and other fusion baselines. The findings underscore the value of spatial grounding and pose-aware context for embodied AI, while acknowledging dependencies on the underlying feature extractors and situational testing scope.

Abstract

For embodied agents to effectively understand and interact within the world around them, they require a nuanced comprehension of human actions grounded in physical space. Current action recognition models, often relying on RGB video, learn superficial correlations between patterns and action labels, so they struggle to capture underlying physical interaction dynamics and human poses in complex scenes. We propose a model architecture that grounds action recognition in physical space by fusing two powerful, complementary representations: V-JEPA 2's contextual, predictive world dynamics and CoMotion's explicit, occlusion-tolerant human pose data. Our model is validated on both the InHARD and UCF-19-Y-OCC benchmarks for general action recognition and high-occlusion action recognition, respectively. Our model outperforms three other baselines, especially within complex, occlusive scenes. Our findings emphasize a need for action recognition to be supported by spatial understanding instead of statistical pattern recognition.

Grounding Foundational Vision Models with 3D Human Poses for Robust Action Recognition

TL;DR

This work addresses the brittleness of RGB-only action recognition in cluttered/occluded scenes by grounding actions in 3D space. It introduces a two-stream fusion architecture that combines V-JEPA 2's contextual world dynamics with CoMotion's explicit 3D human poses using bidirectional cross-attention, followed by self-attention and a final MLP classifier. The approach yields superior, occlusion-robust action recognition on InHARD and the occlusion-focused UCF-19-Y-OCC, outperforming V-JEPA 2, CoMotion, and other fusion baselines. The findings underscore the value of spatial grounding and pose-aware context for embodied AI, while acknowledging dependencies on the underlying feature extractors and situational testing scope.

Abstract

For embodied agents to effectively understand and interact within the world around them, they require a nuanced comprehension of human actions grounded in physical space. Current action recognition models, often relying on RGB video, learn superficial correlations between patterns and action labels, so they struggle to capture underlying physical interaction dynamics and human poses in complex scenes. We propose a model architecture that grounds action recognition in physical space by fusing two powerful, complementary representations: V-JEPA 2's contextual, predictive world dynamics and CoMotion's explicit, occlusion-tolerant human pose data. Our model is validated on both the InHARD and UCF-19-Y-OCC benchmarks for general action recognition and high-occlusion action recognition, respectively. Our model outperforms three other baselines, especially within complex, occlusive scenes. Our findings emphasize a need for action recognition to be supported by spatial understanding instead of statistical pattern recognition.

Paper Structure

This paper contains 11 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Fusion model architecture showing the pipeline of visual and skeletal feature sequences undergoing cross-attention and refinement to classify actions.