Table of Contents
Fetching ...

From Single Images to Motion Policies via Video-Generation Environment Representations

Weiming Zhi, Ziyong Ma, Tianyi Zhang, Matthew Johnson-Roberson

TL;DR

The paper presents VGER, a framework that converts a single RGB image into a dense, geometry-faithful environment representation by conditioning a pretrained video generator to synthesize multi-view frames and fusing them with a 3D foundation model. It then constructs an implicit unsigned distance field via a multi-scale noise-contrastive approach and integrates this field into a metric-modulated motion policy grounded in a Riemannian-like formulation, enabling collision-free motion from a single image. The method demonstrates improved geometric reconstruction over monocular-depth baselines and yields smoother, safer trajectories in diverse environments, highlighting practical implications for data-efficient robot planning. Overall, VGER bridges perception and reactive motion generation from minimal input, with potential impact on safe robot deployment in open, unstructured settings.

Abstract

Autonomous robots typically need to construct representations of their surroundings and adapt their motions to the geometry of their environment. Here, we tackle the problem of constructing a policy model for collision-free motion generation, consistent with the environment, from a single input RGB image. Extracting 3D structures from a single image often involves monocular depth estimation. Developments in depth estimation have given rise to large pre-trained models such as DepthAnything. However, using outputs of these models for downstream motion generation is challenging due to frustum-shaped errors that arise. Instead, we propose a framework known as Video-Generation Environment Representation (VGER), which leverages the advances of large-scale video generation models to generate a moving camera video conditioned on the input image. Frames of this video, which form a multiview dataset, are then input into a pre-trained 3D foundation model to produce a dense point cloud. We then introduce a multi-scale noise approach to train an implicit representation of the environment structure and build a motion generation model that complies with the geometry of the representation. We extensively evaluate VGER over a diverse set of indoor and outdoor environments. We demonstrate its ability to produce smooth motions that account for the captured geometry of a scene, all from a single RGB input image.

From Single Images to Motion Policies via Video-Generation Environment Representations

TL;DR

The paper presents VGER, a framework that converts a single RGB image into a dense, geometry-faithful environment representation by conditioning a pretrained video generator to synthesize multi-view frames and fusing them with a 3D foundation model. It then constructs an implicit unsigned distance field via a multi-scale noise-contrastive approach and integrates this field into a metric-modulated motion policy grounded in a Riemannian-like formulation, enabling collision-free motion from a single image. The method demonstrates improved geometric reconstruction over monocular-depth baselines and yields smoother, safer trajectories in diverse environments, highlighting practical implications for data-efficient robot planning. Overall, VGER bridges perception and reactive motion generation from minimal input, with potential impact on safe robot deployment in open, unstructured settings.

Abstract

Autonomous robots typically need to construct representations of their surroundings and adapt their motions to the geometry of their environment. Here, we tackle the problem of constructing a policy model for collision-free motion generation, consistent with the environment, from a single input RGB image. Extracting 3D structures from a single image often involves monocular depth estimation. Developments in depth estimation have given rise to large pre-trained models such as DepthAnything. However, using outputs of these models for downstream motion generation is challenging due to frustum-shaped errors that arise. Instead, we propose a framework known as Video-Generation Environment Representation (VGER), which leverages the advances of large-scale video generation models to generate a moving camera video conditioned on the input image. Frames of this video, which form a multiview dataset, are then input into a pre-trained 3D foundation model to produce a dense point cloud. We then introduce a multi-scale noise approach to train an implicit representation of the environment structure and build a motion generation model that complies with the geometry of the representation. We extensively evaluate VGER over a diverse set of indoor and outdoor environments. We demonstrate its ability to produce smooth motions that account for the captured geometry of a scene, all from a single RGB input image.

Paper Structure

This paper contains 14 sections, 13 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: (a): Single original image used to construct the environment; (b): Predictions by DepthAnything-V2 depthanything_v2. Predicted depth image of the left, and extracted structure on the right. We observed errors that are frustum-shaped "tails" in the extracted structure; (c): Our proposed VGER does not suffer from these errors, and completes regions blocked from the original view. These artifacts in free space make it impossible to use the representation for motion generation.
  • Figure 2: Pipeline of VGER.
  • Figure 3: Examples of extracting a 3D structure of an outdoor bench (top) and indoor office environment (bottom) from input images. Input images are shown in subfigure (a). We leverage a video generator, conditional on the images, to generate videos, with frames shown in subfigure (b). These are then subsequently used to construct 3D structures via foundation models, without any frustum-shaped artifacts, shown in subfigure (c).
  • Figure 4: We leverage the 3D foundation model, DUSt3R DUSt3R_cvpr24, which can produce 3D structures from sets of 2D images, and filter based on confidence maps.
  • Figure 5: With a single example input image, shown in (a), of a stone model in an indoor environment, VGER can build a 3D representation of the scene. It does not suffer from incomplete surfaces like the results from DepthAnything-V2, shown in (b)). It can also facilitate downstream motion trajectory generation. Two trajectories, colored in red and blue, smoothly avoiding the obstacle, are shown in (c).
  • ...and 10 more figures