Table of Contents
Fetching ...

EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation

Siyuan Huang, Liliang Chen, Pengfei Zhou, Shengcong Chen, Zhengkai Jiang, Yue Hu, Yue Liao, Peng Gao, Hongsheng Li, Maoqing Yao, Guanghui Ren

TL;DR

EnerVerse tackles the challenge of translating language instructions into robust robotic manipulation by modeling embodied 4D spaces. It introduces a chunk-wise diffusion model with sparse memory, a multi-view prior, and a 4DGS data flywheel, bridging sim-to-real. The EnerVerse-A policy head maps predicted 4D futures to action chunks, achieving state-of-the-art results in both simulation and real-world tasks, with efficient inference. Limitations include video artifacts and heuristic camera poses, pointing to future work in view planning and artifact reduction.

Abstract

We introduce EnerVerse, a generative robotics foundation model that constructs and interprets embodied spaces. EnerVerse employs a chunk-wise autoregressive video diffusion framework to predict future embodied spaces from instructions, enhanced by a sparse context memory for long-term reasoning. To model the 3D robotics world, we adopt a multi-view video representation, providing rich perspectives to address challenges like motion ambiguity and 3D grounding. Additionally, EnerVerse-D, a data engine pipeline combining generative modeling with 4D Gaussian Splatting, forms a self-reinforcing data loop to reduce the sim-to-real gap. Leveraging these innovations, EnerVerse translates 4D world representations into physical actions via a policy head (EnerVerse-A), achieving state-of-the-art performance in both simulation and real-world tasks. For efficiency, EnerVerse-A reuses features from the first denoising step and predicts action chunks, achieving about 280 ms per 8-step action chunk on a single RTX 4090. Further video demos, dataset samples could be found in our project page.

EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation

TL;DR

EnerVerse tackles the challenge of translating language instructions into robust robotic manipulation by modeling embodied 4D spaces. It introduces a chunk-wise diffusion model with sparse memory, a multi-view prior, and a 4DGS data flywheel, bridging sim-to-real. The EnerVerse-A policy head maps predicted 4D futures to action chunks, achieving state-of-the-art results in both simulation and real-world tasks, with efficient inference. Limitations include video artifacts and heuristic camera poses, pointing to future work in view planning and artifact reduction.

Abstract

We introduce EnerVerse, a generative robotics foundation model that constructs and interprets embodied spaces. EnerVerse employs a chunk-wise autoregressive video diffusion framework to predict future embodied spaces from instructions, enhanced by a sparse context memory for long-term reasoning. To model the 3D robotics world, we adopt a multi-view video representation, providing rich perspectives to address challenges like motion ambiguity and 3D grounding. Additionally, EnerVerse-D, a data engine pipeline combining generative modeling with 4D Gaussian Splatting, forms a self-reinforcing data loop to reduce the sim-to-real gap. Leveraging these innovations, EnerVerse translates 4D world representations into physical actions via a policy head (EnerVerse-A), achieving state-of-the-art performance in both simulation and real-world tasks. For efficiency, EnerVerse-A reuses features from the first denoising step and predicts action chunks, achieving about 280 ms per 8-step action chunk on a single RTX 4090. Further video demos, dataset samples could be found in our project page.
Paper Structure (20 sections, 2 equations, 14 figures, 10 tables)

This paper contains 20 sections, 2 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: An overview of EnerVerse. With camera observations, we first obtain a 3D reconstruction via depth warping, then multi rendered images. EnerVerse (a) connects to a video generator head (EnerVerse-G) to produce multi-view videos, (b) attaches to a robotic action policy head (EnerVerse-A) for action prediction, and (c) integrates with 4DGS to form a data flywheel (EnerVerse-D) for Sim2Real.
  • Figure 2: An overview of our chunk-wise autoregressive generation approach and multi-view diffusion generator block. (a) During training, random clean frames from consecutive sequences are combined with noisy frames to predict denoised latents. In inference, newly generated denoised frames become the next clean frames for subsequent steps, iterating until the EoS frame is detected. Only a single view of the autoregressive process is shown for clarity. (b) In the multi-view diffusion generator block, observational frames from Camera $i$ or Rendered View $i+1$ are encoded with a VAE. Ray direction maps are concatenated with video latents, followed by conv layers and attention mechanisms.
  • Figure 3: The pipeline for EnerVerse as a data engine. Observation images from multiple cameras and rendered images are processed by the multi-view video generator to produce denoised videos. These videos, along with their camera poses, are used in 4DGS for 4D scene reconstruction. The reconstructed 3D content is rendered to generate high-precision images. These high-quality rendered images are iteratively refined and fed back into the pipeline.
  • Figure 4: Render View 1 and Render View 2 are generated by rendering from a point cloud reconstructed from RGB-Image 1 using depth wrapping. The render views correspond to camera views obtained by rotating the RGB camera view around the Z-axis by $\pm 30^\circ$.
  • Figure 5: Qualitative comparison for single view video generation between EnerVerse and DynamiCrafter(FN) on RT-1 dataset. Since EnerVerse predict EOS frame at 42th frame for this task, we visualize up-to 42th frame sampled from both generated sequence. The sequences generated by DynamiCrafter(FN) did not maintain the logic and produce many hallucinations as the sequence grew. In contrast, the sequence generated by EnerVerse was logically coherent, continuously and completely generating the future space of the entire task, and accurately predicting the EOS frame.
  • ...and 9 more figures