Table of Contents
Fetching ...

VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

Zhongwei Ren, Yunchao Wei, Xun Guo, Yao Zhao, Bingyi Kang, Jiashi Feng, Xiaojie Jin

TL;DR

VideoWorld investigates learning high-level knowledge purely from unlabeled video by combining a VQ-VAE auto-encoder with an autoregressive transformer and a Latent Dynamics Model (LDM) that compresses multi-step visual changes into latent codes. The approach enables learning Go rules and robotic control without text instructions or reward signals, achieving a 5-dan level on Video-GoBench with only 300M parameters and approaching oracle performance on CALVIN and RLBench with strong cross-environment generalization. LDM provides a compact, temporally aware representation that enhances planning and learning efficiency, and ablations reveal the importance of latent codes and horizon length for long-horizon tasks. The work demonstrates the potential of video-only knowledge learners and releases code, data, and models to accelerate subsequent research in vision-based knowledge acquisition.

Abstract

This work explores whether a deep generative model can learn complex knowledge solely from visual input, in contrast to the prevalent focus on text-based models like large language models (LLMs). We develop VideoWorld, an auto-regressive video generation model trained on unlabeled video data, and test its knowledge acquisition abilities in video-based Go and robotic control tasks. Our experiments reveal two key findings: (1) video-only training provides sufficient information for learning knowledge, including rules, reasoning and planning capabilities, and (2) the representation of visual change is crucial for knowledge acquisition. To improve both the efficiency and efficacy of this process, we introduce the Latent Dynamics Model (LDM) as a key component of VideoWorld. Remarkably, VideoWorld reaches a 5-dan professional level in the Video-GoBench with just a 300-million-parameter model, without relying on search algorithms or reward mechanisms typical in reinforcement learning. In robotic tasks, VideoWorld effectively learns diverse control operations and generalizes across environments, approaching the performance of oracle models in CALVIN and RLBench. This study opens new avenues for knowledge acquisition from visual data, with all code, data, and models open-sourced for further research.

VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

TL;DR

VideoWorld investigates learning high-level knowledge purely from unlabeled video by combining a VQ-VAE auto-encoder with an autoregressive transformer and a Latent Dynamics Model (LDM) that compresses multi-step visual changes into latent codes. The approach enables learning Go rules and robotic control without text instructions or reward signals, achieving a 5-dan level on Video-GoBench with only 300M parameters and approaching oracle performance on CALVIN and RLBench with strong cross-environment generalization. LDM provides a compact, temporally aware representation that enhances planning and learning efficiency, and ablations reveal the importance of latent codes and horizon length for long-horizon tasks. The work demonstrates the potential of video-only knowledge learners and releases code, data, and models to accelerate subsequent research in vision-based knowledge acquisition.

Abstract

This work explores whether a deep generative model can learn complex knowledge solely from visual input, in contrast to the prevalent focus on text-based models like large language models (LLMs). We develop VideoWorld, an auto-regressive video generation model trained on unlabeled video data, and test its knowledge acquisition abilities in video-based Go and robotic control tasks. Our experiments reveal two key findings: (1) video-only training provides sufficient information for learning knowledge, including rules, reasoning and planning capabilities, and (2) the representation of visual change is crucial for knowledge acquisition. To improve both the efficiency and efficacy of this process, we introduce the Latent Dynamics Model (LDM) as a key component of VideoWorld. Remarkably, VideoWorld reaches a 5-dan professional level in the Video-GoBench with just a 300-million-parameter model, without relying on search algorithms or reward mechanisms typical in reinforcement learning. In robotic tasks, VideoWorld effectively learns diverse control operations and generalizes across environments, approaching the performance of oracle models in CALVIN and RLBench. This study opens new avenues for knowledge acquisition from visual data, with all code, data, and models open-sourced for further research.
Paper Structure (22 sections, 10 figures, 6 tables, 1 algorithm)

This paper contains 22 sections, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: VideoWorld explores learning knowledge from unlabeled videos, ranging from task-specific rules to high-level reasoning and planning capabilities. Compared to other learning methods: reinforcement learning (RL), supervised learning (SL) and text-based learning, it offers three advantages: 1) better generalization with unified visual representation for various tasks and interfaces, 2) lower manual annotation burden, and 3) learning richer real-world information than text description.
  • Figure 2: Comparison of prediction targets. "State", "Video" and "Video w/ LDM" refer to three different prediction targets: a state sequence (e.g., labeled positions of moves in Go), a raw video sequence, and a video sequence augmented with latent codes representing future visual changes (this approach is adopted by VideoWorld). “Action-Value” denotes the score for each move in the game, with details provided in Sec. \ref{['subsec:goeva']}. By combining rich video information with a compact representation of visual changes, VideoWorld enables more effective learning.
  • Figure 3: Overview of the proposed VideoWorld model architecture. (Left) Overall architecture. (Right) The proposed latent dynamics model (LDM). First, LDM compresses the visual changes from each frame to its subsequent $H$ frames into a set of latent codes. Then, an auto-regressive transformer seamlessly integrates the output of LDM with the next token prediction paradigm.
  • Figure 4: UMAP projectionleland2018umapof the learned latent code on the Go (Left) and CALVIN (right) training set. Each point represents the continuous (pre-quantization) latent code generated by the LDM. In Go examples, odd steps represent white's moves, and even steps represent black's moves. We visualize the latent codes of black moves in steps 2/4/6. The legend shows examples of common patterns learned for new black moves. For clarity, these moves are highlighted on the board with added colors and lines to indicate new patterns. On the right, we visualize the latent codes of the robotic arm's movement along the X/Y/Z axes at intervals of 1, 5, and 10 frames. Points are color-coded by displacement range, with purple and red indicating the maximum displacement in opposite directions along each axis.
  • Figure 5: Illustration of playing against KataGO and UMAP projection leland2018umap of the predicted latent code. Our model plays as black. The generated latent code is visualized through the LDM decoder and new stones in the visualization are marked with colors to match the legend. The visualization serves as a probe, indicating that the model shows signs of forward planning.
  • ...and 5 more figures