Table of Contents
Fetching ...

MuBlE: MuJoCo and Blender simulation Environment and Benchmark for Task Planning in Robot Manipulation

Michal Nazarczuk, Karla Stepanova, Jan Kristof Behrens, Matej Hoffmann, Krystian Mikolajczyk

TL;DR

MuBlE addresses the challenge of developing embodied reasoning agents for long-horizon robot manipulation by providing a MuJoCo-based physics engine coupled with Blender-based photorealistic rendering within robosuite. It introduces SHOP-VRB2, a 12,000-scene multimodal benchmark demanding simultaneous visual and physical reasoning across ten tasks, plus data-generation tools for scenes, instructions, and ground-truth annotations. The authors demonstrate baselines on SHOP-VRB2 and real-world YCB scenes, showing meaningful sim-to-real transfer aided by high-fidelity rendering and accurate physics, while highlighting current difficulties in long-horizon manipulation. This work offers a scalable framework and benchmark to foster advances in closed-loop planning and multimodal understanding for robot manipulation, with practical impact on sim-to-real transfer and evaluation of embodied reasoning systems.

Abstract

Current embodied reasoning agents struggle to plan for long-horizon tasks that require to physically interact with the world to obtain the necessary information (e.g. 'sort the objects from lightest to heaviest'). The improvement of the capabilities of such an agent is highly dependent on the availability of relevant training environments. In order to facilitate the development of such systems, we introduce a novel simulation environment (built on top of robosuite) that makes use of the MuJoCo physics engine and high-quality renderer Blender to provide realistic visual observations that are also accurate to the physical state of the scene. It is the first simulator focusing on long-horizon robot manipulation tasks preserving accurate physics modeling. MuBlE can generate mutlimodal data for training and enable design of closed-loop methods through environment interaction on two levels: visual - action loop, and control - physics loop. Together with the simulator, we propose SHOP-VRB2, a new benchmark composed of 10 classes of multi-step reasoning scenarios that require simultaneous visual and physical measurements.

MuBlE: MuJoCo and Blender simulation Environment and Benchmark for Task Planning in Robot Manipulation

TL;DR

MuBlE addresses the challenge of developing embodied reasoning agents for long-horizon robot manipulation by providing a MuJoCo-based physics engine coupled with Blender-based photorealistic rendering within robosuite. It introduces SHOP-VRB2, a 12,000-scene multimodal benchmark demanding simultaneous visual and physical reasoning across ten tasks, plus data-generation tools for scenes, instructions, and ground-truth annotations. The authors demonstrate baselines on SHOP-VRB2 and real-world YCB scenes, showing meaningful sim-to-real transfer aided by high-fidelity rendering and accurate physics, while highlighting current difficulties in long-horizon manipulation. This work offers a scalable framework and benchmark to foster advances in closed-loop planning and multimodal understanding for robot manipulation, with practical impact on sim-to-real transfer and evaluation of embodied reasoning systems.

Abstract

Current embodied reasoning agents struggle to plan for long-horizon tasks that require to physically interact with the world to obtain the necessary information (e.g. 'sort the objects from lightest to heaviest'). The improvement of the capabilities of such an agent is highly dependent on the availability of relevant training environments. In order to facilitate the development of such systems, we introduce a novel simulation environment (built on top of robosuite) that makes use of the MuJoCo physics engine and high-quality renderer Blender to provide realistic visual observations that are also accurate to the physical state of the scene. It is the first simulator focusing on long-horizon robot manipulation tasks preserving accurate physics modeling. MuBlE can generate mutlimodal data for training and enable design of closed-loop methods through environment interaction on two levels: visual - action loop, and control - physics loop. Together with the simulator, we propose SHOP-VRB2, a new benchmark composed of 10 classes of multi-step reasoning scenarios that require simultaneous visual and physical measurements.

Paper Structure

This paper contains 15 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: An example task from the proposed SHOP-VRB2 benchmark presenting capabilities of the proposed MuBlE environment: synthetic scene and instruction generation, execution of symbolic actions for manipulation followed by physics calculation and realistic rendering. Symbolic actions with corresponding targets marked in the image.
  • Figure 2: A diagram showing individual modules of the MuBlE environment, including how a reasoning approach might be integrated within the MuBlE environment. Example instruction and scene from SHOP-VRB2 benchmark are shown. Symbols for transferred data: $\mathtt{T}$ - query text, $\mathtt{I}$ - image, $\mathtt{G}$ - scene graph, $\mathtt{P}$ - physical observations, $\mathtt{C}$ - control signal, $\mathtt{A}$ - primitive action to take, $\mathtt{R}$ - returned result, $\mathtt{GT}$ - ground truth data.
  • Figure 3: An example of the interaction between MuBlE (in yellow) and a reasoning method (in orange). Figure presents selected measurements $\mathtt{P}$ and primitive actions $\mathtt{A}$ generated based on them, followed by a corresponding update of the scene in the environment.
  • Figure 4: Example of SHOP-VRB2: example simulated scenes and corresponding instructions in natural language generated with MuBlE (in the dataset, instructions left to right belong to tasks 7, 3, and 1 in Tab. \ref{['tab:templates']}).
  • Figure 5: Examples of visual observations (selected frames) generated for the actions corresponding to the instruction: Stack metal objects from heaviest to lightest. (Left) Simulated YCB scenes rendered by Blender in MuBlE and (Right) corresponding real YCB scenes captured by Realsense camera during the real-world experiment using the reasoning pretrained in the MuBlE environment on the simulated SHOP-VRB2 dataset.
  • ...and 1 more figures