Table of Contents
Fetching ...

VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation

Hanzhi Chen, Boyang Sun, Anran Zhang, Marc Pollefeys, Stefan Leutenegger

TL;DR

VidBot tackles the embodiment gap in robot manipulation by learning agent-agnostic 3D affordances from in-the-wild RGB-only human videos. It first reconstructs metric-scale 3D hand trajectories via a SfM-based pipeline augmented with a metric-depth foundation model, then employs a coarse-to-fine learning framework where coarse predictions of contact and goal points guide a diffusion-based fine trajectory generator. Test-time cost guidance, including multi-goal conditioning and collision/normal constraints, improves trajectory plausibility and context awareness, enabling robust zero-shot transfer to new robots and environments. Experiments in simulation and on real robots show substantial improvements over baselines on 13 tasks and demonstrate practical applicability to downstream robot learning tasks, highlighting the approach’s scalability for leveraging everyday videos in robotic learning.

Abstract

Future robots are envisioned as versatile systems capable of performing a variety of household tasks. The big question remains, how can we bridge the embodiment gap while minimizing physical robot learning, which fundamentally does not scale well. We argue that learning from in-the-wild human videos offers a promising solution for robotic manipulation tasks, as vast amounts of relevant data already exist on the internet. In this work, we present VidBot, a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos. VidBot leverages a pipeline to extract explicit representations from them, namely 3D hand trajectories from videos, combining a depth foundation model with structure-from-motion techniques to reconstruct temporally consistent, metric-scale 3D affordance representations agnostic to embodiments. We introduce a coarse-to-fine affordance learning model that first identifies coarse actions from the pixel space and then generates fine-grained interaction trajectories with a diffusion model, conditioned on coarse actions and guided by test-time constraints for context-aware interaction planning, enabling substantial generalization to novel scenes and embodiments. Extensive experiments demonstrate the efficacy of VidBot, which significantly outperforms counterparts across 13 manipulation tasks in zero-shot settings and can be seamlessly deployed across robot systems in real-world environments. VidBot paves the way for leveraging everyday human videos to make robot learning more scalable.

VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation

TL;DR

VidBot tackles the embodiment gap in robot manipulation by learning agent-agnostic 3D affordances from in-the-wild RGB-only human videos. It first reconstructs metric-scale 3D hand trajectories via a SfM-based pipeline augmented with a metric-depth foundation model, then employs a coarse-to-fine learning framework where coarse predictions of contact and goal points guide a diffusion-based fine trajectory generator. Test-time cost guidance, including multi-goal conditioning and collision/normal constraints, improves trajectory plausibility and context awareness, enabling robust zero-shot transfer to new robots and environments. Experiments in simulation and on real robots show substantial improvements over baselines on 13 tasks and demonstrate practical applicability to downstream robot learning tasks, highlighting the approach’s scalability for leveraging everyday videos in robotic learning.

Abstract

Future robots are envisioned as versatile systems capable of performing a variety of household tasks. The big question remains, how can we bridge the embodiment gap while minimizing physical robot learning, which fundamentally does not scale well. We argue that learning from in-the-wild human videos offers a promising solution for robotic manipulation tasks, as vast amounts of relevant data already exist on the internet. In this work, we present VidBot, a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos. VidBot leverages a pipeline to extract explicit representations from them, namely 3D hand trajectories from videos, combining a depth foundation model with structure-from-motion techniques to reconstruct temporally consistent, metric-scale 3D affordance representations agnostic to embodiments. We introduce a coarse-to-fine affordance learning model that first identifies coarse actions from the pixel space and then generates fine-grained interaction trajectories with a diffusion model, conditioned on coarse actions and guided by test-time constraints for context-aware interaction planning, enabling substantial generalization to novel scenes and embodiments. Extensive experiments demonstrate the efficacy of VidBot, which significantly outperforms counterparts across 13 manipulation tasks in zero-shot settings and can be seamlessly deployed across robot systems in real-world environments. VidBot paves the way for leveraging everyday human videos to make robot learning more scalable.

Paper Structure

This paper contains 20 sections, 11 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Example trajectories extracted from raw human videos.
  • Figure 2: Overview of our affordance learning model. The affordance model is factorized into a coarse stage and a fine stage. We parse high-level contact and goal configurations from task inputs in the coarse stage. Supp. Mat. provides more detailed illustration of conditional feature extraction process. In the fine stage, we utilize the coarse stage outputs to guide the fine-grained interaction trajectory generation process through conditioning and cost functions. The color represents the final cost value, with darker shades indicating lower costs.
  • Figure 3: (a) Average success rate for the visual goal-reaching task. (b) Average coincidental success for the exploration task.
  • Figure 4: Predicted affordance by VRB bahl2023affordances and ours given instruction and RGB-D image. Though using the same RGB-only human videos for training, our framework predicts much more accurate contact points and interaction trajectories in 3D space directly, outperforming VRB bahl2023affordances with ambiguous prediction in the pixel space. We visualize the top five affordance samples inferred by our model, where colors represent the final cost values; darker shades indicate lower costs and, therefore, a higher rank for the agent to execute.
  • Figure 5: Real-world robotic manipulation tasks with inferred affordance displayed in the top panels.