Table of Contents
Fetching ...

PAWS: Perception of Articulation in the Wild at Scale from Egocentric Videos

Yihao Wang, Yang Miao, Wenshuai Zhao, Wenyan Yang, Zihan Wang, Joni Pajarinen, Luc Van Gool, Danda Pani Paudel, Juho Kannala, Xi Wang, Arno Solin

Abstract

Articulation perception aims to recover the motion and structure of articulated objects (e.g., drawers and cupboards), and is fundamental to 3D scene understanding in robotics, simulation, and animation. Existing learning-based methods rely heavily on supervised training with high-quality 3D data and manual annotations, limiting scalability and diversity. To address this limitation, we propose PAWS, a method that directly extracts object articulations from hand-object interactions in large-scale in-the-wild egocentric videos. We evaluate our method on the public data sets, including HD-EPIC and Arti4D data sets, achieving significant improvements over baselines. We further demonstrate that the extracted articulations benefit downstream tasks, including fine-tuning 3D articulation prediction models and enabling robot manipulation. See the project website at https://aaltoml.github.io/PAWS/.

PAWS: Perception of Articulation in the Wild at Scale from Egocentric Videos

Abstract

Articulation perception aims to recover the motion and structure of articulated objects (e.g., drawers and cupboards), and is fundamental to 3D scene understanding in robotics, simulation, and animation. Existing learning-based methods rely heavily on supervised training with high-quality 3D data and manual annotations, limiting scalability and diversity. To address this limitation, we propose PAWS, a method that directly extracts object articulations from hand-object interactions in large-scale in-the-wild egocentric videos. We evaluate our method on the public data sets, including HD-EPIC and Arti4D data sets, achieving significant improvements over baselines. We further demonstrate that the extracted articulations benefit downstream tasks, including fine-tuning 3D articulation prediction models and enabling robot manipulation. See the project website at https://aaltoml.github.io/PAWS/.

Paper Structure

This paper contains 60 sections, 9 equations, 13 figures, 6 tables, 1 algorithm.

Figures (13)

  • Figure 1: PAWS: Articulation perception and localization from in-the-wild egocentric videos. (a) From raw videos of human interactions, (b) our method reconstructs the 3D scene and object articulations using hand cues, geometric recovery, and VLM reasoning. (c) These serve as annotations to improve downstream articulation prediction models via finetuning, while also providing 3D priors for real-world robotic manipulation.
  • Figure 2: Overall pipeline. Given a full in-the-wild egocentric video and a language description as input, our pipeline consists of four parts: (1) Dynamic Interaction Perception: We first segment the video based on the language description and extract interactive frames (referred to as "local views"), 3D hand trajectories, motion types, and coarse object localizations. (2) Geometric Structure Recovery: Based on the object's location, we select "global views" from the full video. Depending on the motion type, we recover the scene geometry using different flows. (3) VLM-guided Reasoning: The VLM first infers the motion type to provide a prior for global view selection, and then identifies plausible articulation axes during the geometry recovery stage. (4) Joint Articulation Inference: We integrate 3D hand trajectories and the recovered geometry to infer the final articulations.
  • Figure 3: Illustration of VLM Reasoning.(a) Temporal Motion Type Classification. (b) Spatial Axis Grounding via Set-of-Marks VQA.
  • Figure 4: PAWS for robot manipulation.Left: Spot closes the cupboard. Right: Spot opens the drawer. Insets show egocentric videos of hand-object interactions and the reconstructed 3D articulations.
  • Figure A5: Illustration of the hand filtering pipeline. Starting from noisy MANO fingertip observations $\mathbf{z}_t$, we apply contact-based trimming, forward Kalman filtering with $\chi^2$ outlier rejection and RTS smoothing to obtain the refined trajectories $\{\hat{\mathbf{p}}_t\}$ used for articulation parameter estimation.
  • ...and 8 more figures