Table of Contents
Fetching ...

Robot Data Curation with Mutual Information Estimators

Joey Hejna, Suvir Mirchandani, Ashwin Balakrishna, Annie Xie, Ayzaan Wahid, Jonathan Tompson, Pannag Sanketi, Dhruv Shah, Coline Devin, Dorsa Sadigh

TL;DR

This work targets the data-quality problem in robotics imitation learning by introducing Demonstration Information Estimation (DemInf), an unsupervised method that scores individual demonstrations based on their contribution to the dataset's mutual information between states and actions. DemInf leverages variational autoencoders to embed states and actions into latent spaces and uses a non-parametric $k$-NN estimator (KSG) to compute per-sample MI contributions, enabling trajectory-level scoring and selective data filtering. Across RoboMimic, RoboCrowd, and Franka datasets, DemInf outperforms baselines and other MI estimators, achieving 5–10% improvements in RoboMimic and better results on real-world tasks when training BC policies on filtered data. The approach addresses data quality without hand annotations and offers a practical pathway to improve robotic imitation learning in data-limited settings, while recognizing limitations related to dynamics and online deployment for future work.

Abstract

The performance of imitation learning policies often hinges on the datasets with which they are trained. Consequently, investment in data collection for robotics has grown across both industrial and academic labs. However, despite the marked increase in the quantity of demonstrations collected, little work has sought to assess the quality of said data despite mounting evidence of its importance in other areas such as vision and language. In this work, we take a critical step towards addressing the data quality in robotics. Given a dataset of demonstrations, we aim to estimate the relative quality of individual demonstrations in terms of both action diversity and predictability. To do so, we estimate the average contribution of a trajectory towards the mutual information between states and actions in the entire dataset, which captures both the entropy of the marginal action distribution and the state-conditioned action entropy. Though commonly used mutual information estimators require vast amounts of data often beyond the scale available in robotics, we introduce a novel technique based on k-nearest neighbor estimates of mutual information on top of simple VAE embeddings of states and actions. Empirically, we demonstrate that our approach is able to partition demonstration datasets by quality according to human expert scores across a diverse set of benchmarks spanning simulation and real world environments. Moreover, training policies based on data filtered by our method leads to a 5-10% improvement in RoboMimic and better performance on real ALOHA and Franka setups.

Robot Data Curation with Mutual Information Estimators

TL;DR

This work targets the data-quality problem in robotics imitation learning by introducing Demonstration Information Estimation (DemInf), an unsupervised method that scores individual demonstrations based on their contribution to the dataset's mutual information between states and actions. DemInf leverages variational autoencoders to embed states and actions into latent spaces and uses a non-parametric -NN estimator (KSG) to compute per-sample MI contributions, enabling trajectory-level scoring and selective data filtering. Across RoboMimic, RoboCrowd, and Franka datasets, DemInf outperforms baselines and other MI estimators, achieving 5–10% improvements in RoboMimic and better results on real-world tasks when training BC policies on filtered data. The approach addresses data quality without hand annotations and offers a practical pathway to improve robotic imitation learning in data-limited settings, while recognizing limitations related to dynamics and online deployment for future work.

Abstract

The performance of imitation learning policies often hinges on the datasets with which they are trained. Consequently, investment in data collection for robotics has grown across both industrial and academic labs. However, despite the marked increase in the quantity of demonstrations collected, little work has sought to assess the quality of said data despite mounting evidence of its importance in other areas such as vision and language. In this work, we take a critical step towards addressing the data quality in robotics. Given a dataset of demonstrations, we aim to estimate the relative quality of individual demonstrations in terms of both action diversity and predictability. To do so, we estimate the average contribution of a trajectory towards the mutual information between states and actions in the entire dataset, which captures both the entropy of the marginal action distribution and the state-conditioned action entropy. Though commonly used mutual information estimators require vast amounts of data often beyond the scale available in robotics, we introduce a novel technique based on k-nearest neighbor estimates of mutual information on top of simple VAE embeddings of states and actions. Empirically, we demonstrate that our approach is able to partition demonstration datasets by quality according to human expert scores across a diverse set of benchmarks spanning simulation and real world environments. Moreover, training policies based on data filtered by our method leads to a 5-10% improvement in RoboMimic and better performance on real ALOHA and Franka setups.

Paper Structure

This paper contains 28 sections, 23 equations, 21 figures, 1 table.

Figures (21)

  • Figure 1: A graphical depiction of the DemInf method. First, we begin by learning VAEs for states and action chunks to produce latent representations $z_a$ and $z_s$. Using these latent representations, we apply the KSG $k$-nearest-neighbor based mutual information estimator. Finally, we filter demonstrations based on their estimated mutual information.
  • Figure 2: The average estimated $\hat{I}(s;a)$ per timestep for high quality data ("better" demonstrators) in "Square MH" from RoboMimic robomimic. Notice that at the start of the trajectory and after the grasp (75-100 steps), $\hat{I}$ is highest, while it is low during the grasp period (50-75 steps).
  • Figure 3: Visualization of the tasks represented in the datasets we use in this work, including the Can MH, Lift MH, and Square MH datasets from RoboMimic; real-world PenInCup and DishRack datasets collected on a Franka robot; and the real-world TootsieRoll, HiChew, and HersheyKiss datasets from RoboCrowd for the ALOHA robot.
  • Figure 4: Average quality of demonstrations remaining in datasets after filtering with different choices of $S$ on the Lift, Can, and Square Multi-Human (Mh) datasets from the Robomimic benchmark with states (Left) and images (right). Results are shown as an average of 3 seeds.
  • Figure 5: Average quality of demonstrations remaining in datasets after filtering with different choices of $S$ on the Hi-Chew, Tootsie-Roll, and Hershey-Kiss crowdsourced datasets from the RoboCrowd benchmark. We include results for datasets with a combination of expert and only task-relevant data (left), and a version of the data that contains additional unstructured play data (right). Results are shown as an average of 3 seeds.
  • ...and 16 more figures