Table of Contents
Fetching ...

Towards Generalist Robot Learning from Internet Video: A Survey

Robert McCarthy, Daniel C. H. Tan, Dominik Schmidt, Fernando Acero, Nathan Herr, Yilun Du, Thomas G. Thuruthel, Zhibin Li

TL;DR

The paper addresses the data bottleneck in achieving generalist robots by surveying Learning from Videos (LfV) that leverage vast internet video. It outlines how video foundation models and scalable representations can extract robotics-relevant knowledge to train policies and dynamics models, even when action labels and low-level data are missing. The authors categorize methods by RL knowledge modalities, discuss datasets and benchmarks, and critically analyze challenges such as distribution shift and computational demands, offering concrete recommendations for future work. Overall, the survey argues that scalable LfV, especially via video foundation models and action representations, can significantly advance general-purpose robotic capabilities while highlighting open questions and the need for robust evaluation frameworks.

Abstract

Scaling deep learning to massive and diverse internet data has driven remarkable breakthroughs in domains such as video generation and natural language processing. Robot learning, however, has thus far failed to replicate this success and remains constrained by a scarcity of available data. Learning from videos (LfV) methods aim to address this data bottleneck by augmenting traditional robot data with large-scale internet video. This video data provides foundational information regarding physical dynamics, behaviours, and tasks, and can be highly informative for general-purpose robots. This survey systematically examines the emerging field of LfV. We first outline essential concepts, including detailing fundamental LfV challenges such as distribution shift and missing action labels in video data. Next, we comprehensively review current methods for extracting knowledge from large-scale internet video, overcoming LfV challenges, and improving robot learning through video-informed training. The survey concludes with a critical discussion of future opportunities. Here, we emphasize the need for scalable foundation model approaches that can leverage the full range of available internet video and enhance the learning of robot policies and dynamics models. Overall, the survey aims to inform and catalyse future LfV research, driving progress towards general-purpose robots.

Towards Generalist Robot Learning from Internet Video: A Survey

TL;DR

The paper addresses the data bottleneck in achieving generalist robots by surveying Learning from Videos (LfV) that leverage vast internet video. It outlines how video foundation models and scalable representations can extract robotics-relevant knowledge to train policies and dynamics models, even when action labels and low-level data are missing. The authors categorize methods by RL knowledge modalities, discuss datasets and benchmarks, and critically analyze challenges such as distribution shift and computational demands, offering concrete recommendations for future work. Overall, the survey argues that scalable LfV, especially via video foundation models and action representations, can significantly advance general-purpose robotic capabilities while highlighting open questions and the need for robust evaluation frameworks.

Abstract

Scaling deep learning to massive and diverse internet data has driven remarkable breakthroughs in domains such as video generation and natural language processing. Robot learning, however, has thus far failed to replicate this success and remains constrained by a scarcity of available data. Learning from videos (LfV) methods aim to address this data bottleneck by augmenting traditional robot data with large-scale internet video. This video data provides foundational information regarding physical dynamics, behaviours, and tasks, and can be highly informative for general-purpose robots. This survey systematically examines the emerging field of LfV. We first outline essential concepts, including detailing fundamental LfV challenges such as distribution shift and missing action labels in video data. Next, we comprehensively review current methods for extracting knowledge from large-scale internet video, overcoming LfV challenges, and improving robot learning through video-informed training. The survey concludes with a critical discussion of future opportunities. Here, we emphasize the need for scalable foundation model approaches that can leverage the full range of available internet video and enhance the learning of robot policies and dynamics models. Overall, the survey aims to inform and catalyse future LfV research, driving progress towards general-purpose robots.
Paper Structure (119 sections, 1 equation, 8 figures, 1 table)

This paper contains 119 sections, 1 equation, 8 figures, 1 table.

Figures (8)

  • Figure 1: An overview of the key concepts and taxonomies in this survey. The top green box presents the high-level motivation behind LfV. The middle orange boxes highlight the benefits (Section \ref{['sec:benefits']}) and challenges (Section \ref{['sec:challenges']}) of LfV. The bottom blue box visualises possible components in a pipeline for learning from large-scale internet video, as per the taxonomies presented in the survey. Large internet video datasets (Section \ref{['sec:datasets']}) can be used to pretrain (video) foundation models (Section \ref{['sec:video_FMs']}). These models can be adapted (e.g., via zero-shot transfer or finetuning) into reinforcement learning (RL) 'knowledge modalities' wulfmeier2023foundations for use in the robot domain (Section \ref{['sec:applications']}). The diagram additionally highlights that action representations (Section \ref{['sec:alt_actions']}) can be used to mitigate the issue of missing action labels in video.
  • Figure 2: Generalization in the Learning from Videos (LfV) setting. The x-axis indicates the range of behaviours expected from a generalist robot. The y-axis indicates the 'levels' of information contained in data. The figure demonstrates that internet data has better coverage over desired behaviours than narrow robot datasets, but lacks crucial low-level information essential to robotics. Generalising beyond the robot data despite this missing low-level information is a key LfV challenge. See Sections \ref{['sec:benefits']} and \ref{['sec:challenges']} for further discussion.
  • Figure 3: Key challenges in LfV (see Section \ref{['sec:challenges']}) are visualised, including: missing (action and low-level) information in video, LfV distribution shifts, and the high-dimensional nature of video data.
  • Figure 4: Video Foundation Modelling for LfV (see Section \ref{['sec:video_FMs']}). The top green box outlines different categories of video foundation models and their applications to robotics. The bottom blue boxes illustrate ways video foundation models can contribute to LfV: (left) pretrained video foundation models can be finetuned into robot foundation models; (right) video foundation model techniques and datasets can be used to train robot foundation models.
  • Figure 5: Recovering action representations from video to overcome the missing action label problem in LfV (Section \ref{['sec:alt_actions']}).
  • ...and 3 more figures