Table of Contents
Fetching ...

Learning by Watching: A Review of Video-based Learning Approaches for Robot Manipulation

Chrisantus Eze, Christopher Crick

TL;DR

<3-5 sentence high-level summary>The survey investigates how robots can acquire manipulation skills by passively watching online videos, addressing data scarcity and dataset bias that hinder traditional robotics learning. It surveys foundations in video representations, affordances, 3D hand modeling, and datasets, then analyzes learning approaches spanning reinforcement learning, imitation learning, hybrids, and multi-modal/foundation-model paradigms. It highlights benchmarks (RLBench, CALVIN, RoboTube) and open-source tooling, while outlining challenges in data, domain shift, computation, and evaluation, and advocates causal reasoning as a future direction. The work underscores that video-based supervision offers scalable, generalizable supervision to advance vision-based robot manipulation toward real-world deployment.

Abstract

Robot learning of manipulation skills is hindered by the scarcity of diverse, unbiased datasets. While curated datasets can help, challenges remain in generalizability and real-world transfer. Meanwhile, large-scale "in-the-wild" video datasets have driven progress in computer vision through self-supervised techniques. Translating this to robotics, recent works have explored learning manipulation skills by passively watching abundant videos sourced online. Showing promising results, such video-based learning paradigms provide scalable supervision while reducing dataset bias. This survey reviews foundations such as video feature representation learning techniques, object affordance understanding, 3D hand/body modeling, and large-scale robot resources, as well as emerging techniques for acquiring robot manipulation skills from uncontrolled video demonstrations. We discuss how learning only from observing large-scale human videos can enhance generalization and sample efficiency for robotic manipulation. The survey summarizes video-based learning approaches, analyses their benefits over standard datasets, survey metrics, and benchmarks, and discusses open challenges and future directions in this nascent domain at the intersection of computer vision, natural language processing, and robot learning.

Learning by Watching: A Review of Video-based Learning Approaches for Robot Manipulation

TL;DR

<3-5 sentence high-level summary>The survey investigates how robots can acquire manipulation skills by passively watching online videos, addressing data scarcity and dataset bias that hinder traditional robotics learning. It surveys foundations in video representations, affordances, 3D hand modeling, and datasets, then analyzes learning approaches spanning reinforcement learning, imitation learning, hybrids, and multi-modal/foundation-model paradigms. It highlights benchmarks (RLBench, CALVIN, RoboTube) and open-source tooling, while outlining challenges in data, domain shift, computation, and evaluation, and advocates causal reasoning as a future direction. The work underscores that video-based supervision offers scalable, generalizable supervision to advance vision-based robot manipulation toward real-world deployment.

Abstract

Robot learning of manipulation skills is hindered by the scarcity of diverse, unbiased datasets. While curated datasets can help, challenges remain in generalizability and real-world transfer. Meanwhile, large-scale "in-the-wild" video datasets have driven progress in computer vision through self-supervised techniques. Translating this to robotics, recent works have explored learning manipulation skills by passively watching abundant videos sourced online. Showing promising results, such video-based learning paradigms provide scalable supervision while reducing dataset bias. This survey reviews foundations such as video feature representation learning techniques, object affordance understanding, 3D hand/body modeling, and large-scale robot resources, as well as emerging techniques for acquiring robot manipulation skills from uncontrolled video demonstrations. We discuss how learning only from observing large-scale human videos can enhance generalization and sample efficiency for robotic manipulation. The survey summarizes video-based learning approaches, analyses their benefits over standard datasets, survey metrics, and benchmarks, and discusses open challenges and future directions in this nascent domain at the intersection of computer vision, natural language processing, and robot learning.
Paper Structure (54 sections, 7 figures, 20 tables)

This paper contains 54 sections, 7 figures, 20 tables.

Figures (7)

  • Figure 1: Left: Advanced Action Modeling - This group of models uses generative models for action representation, Right: Spatial and Embodied Reasoning - This group of works goes beyond basic visual inputs by incorporating a deeper understanding of 3D space and physical relationships (relevant works: SpatialVLA and Gemini Robotics)
  • Figure 2: This categorizes the models into three groups based on their core design philosophy, showing a high-level view of how each model approaches the problem of VLA modeling. Left: directly adapting a pre-trained VLM for robotic control (relevant works: OpenVLA, Octo, Gemini Robotics, SpatialVLA), Middle: Separates the high-level reasoning from low-level action generation (relevant works: CogACT, GROOT N1, CoT-VLA), Right: Focuses on pre-training on massive datasets of non-robotic videos to learn the underlying dynamics of the world, a form of "embodied physics," before specializing for robot control. (relevant works: GR-2)
  • Figure 3: A comprehensive timeline organized chronologically by publication year, highlighting the key breakthroughs and milestones each work introduced to the field.
  • Figure 4: Feature extraction methods in video-based robot learning. CNN-based pipelines extract object features and masks, while pose/keypoint methods capture skeletal motion cues, both providing intermediate representations for policy learning.
  • Figure 5: Reinforcement learning paradigms in video-based robot learning - Left: visual RL with feature extraction, grounding policies in parsed video features; Right: structured and hierarchical RL, learning high-level video embeddings for multiple tasks, and decomposing long-horizon tasks into subtasks and primitive skills.
  • ...and 2 more figures