Table of Contents
Fetching ...

On the Evaluation of Generative Robotic Simulations

Feng Chen, Botian Xu, Pu Hua, Peiqi Duan, Yanchao Yang, Yi Ma, Huazhe Xu

TL;DR

The findings reveal that while metrics of quality and diversity can be achieved through certain methods, no single approach excels across all metrics, suggesting a need for greater focus on balancing these different metrics.

Abstract

Due to the difficulty of acquiring extensive real-world data, robot simulation has become crucial for parallel training and sim-to-real transfer, highlighting the importance of scalable simulated robotic tasks. Foundation models have demonstrated impressive capacities in autonomously generating feasible robotic tasks. However, this new paradigm underscores the challenge of adequately evaluating these autonomously generated tasks. To address this, we propose a comprehensive evaluation framework tailored to generative simulations. Our framework segments evaluation into three core aspects: quality, diversity, and generalization. For single-task quality, we evaluate the realism of the generated task and the completeness of the generated trajectories using large language models and vision-language models. In terms of diversity, we measure both task and data diversity through text similarity of task descriptions and world model loss trained on collected task trajectories. For task-level generalization, we assess the zero-shot generalization ability on unseen tasks of a policy trained with multiple generated tasks. Experiments conducted on three representative task generation pipelines demonstrate that the results from our framework are highly consistent with human evaluations, confirming the feasibility and validity of our approach. The findings reveal that while metrics of quality and diversity can be achieved through certain methods, no single approach excels across all metrics, suggesting a need for greater focus on balancing these different metrics. Additionally, our analysis further highlights the common challenge of low generalization capability faced by current works. Our anonymous website: https://sites.google.com/view/evaltasks.

On the Evaluation of Generative Robotic Simulations

TL;DR

The findings reveal that while metrics of quality and diversity can be achieved through certain methods, no single approach excels across all metrics, suggesting a need for greater focus on balancing these different metrics.

Abstract

Due to the difficulty of acquiring extensive real-world data, robot simulation has become crucial for parallel training and sim-to-real transfer, highlighting the importance of scalable simulated robotic tasks. Foundation models have demonstrated impressive capacities in autonomously generating feasible robotic tasks. However, this new paradigm underscores the challenge of adequately evaluating these autonomously generated tasks. To address this, we propose a comprehensive evaluation framework tailored to generative simulations. Our framework segments evaluation into three core aspects: quality, diversity, and generalization. For single-task quality, we evaluate the realism of the generated task and the completeness of the generated trajectories using large language models and vision-language models. In terms of diversity, we measure both task and data diversity through text similarity of task descriptions and world model loss trained on collected task trajectories. For task-level generalization, we assess the zero-shot generalization ability on unseen tasks of a policy trained with multiple generated tasks. Experiments conducted on three representative task generation pipelines demonstrate that the results from our framework are highly consistent with human evaluations, confirming the feasibility and validity of our approach. The findings reveal that while metrics of quality and diversity can be achieved through certain methods, no single approach excels across all metrics, suggesting a need for greater focus on balancing these different metrics. Additionally, our analysis further highlights the common challenge of low generalization capability faced by current works. Our anonymous website: https://sites.google.com/view/evaltasks.

Paper Structure

This paper contains 28 sections, 1 equation, 5 figures, 6 tables.

Figures (5)

  • Figure 1: We propose three main aspects for evaluating generative simulations: Quality, Diversity, and Generalization. Quality encompasses two components: the alignment of the task scene with the real world, and the completion score, which assesses if the robot’s trajectory solves the task. Diversity is divided into two components as well: text-based diversity of task descriptions, and dynamics-based diversity among trajectory data. Generalization involves assessing the data's generalization ability using a representative imitation learning model.
  • Figure 2: Overview of our evaluation framework. In our method, the evaluation is divided into three parts. We initially employ LLM and VLM to evaluate scene alignment and task completion for generated tasks. These tasks are subsequently categorized into groups for assessment on two fronts: task diversity, gauged by the textual similarity of task descriptions, and data diversity, measured by prediction errors from a world model. Finally, we assess the generalization capability of a policy trained on generated data.
  • Figure 3: Pearson correlation divided by mean absolute error of the different methods with human evaluation in different datasets. In the bar chart, relatively high values indicate that the model's results are more similar to human evaluations, while negative values indicate that the model's output is negatively correlated with human evaluations. We truncate the negative bars for better visualization.
  • Figure 4: Single task evaluation results. "-P" flag refers to the published tasks of a certain method, while "-G" flag refers to generated tasks by running released codes. The size of the data marker represents the variance of the evaluation results under the corresponding setting.
  • Figure 5: Single-task evaluation examples on three different tasks from different generative simulation pipelines. The first row displays a task that achieves high scores in both scene alignment and task completion. The second row illustrates a task with low scene alignment, while the third row presents a task with low task completion.