Table of Contents
Fetching ...

Diversity You Can Actually Measure: A Fast, Model-Free Diversity Metric for Robotics Datasets

Sreevardhan Sirigiri, Nathan Samuel de Lara, Christopher Agia, Florian Shkurti, Fabio Ramos

Abstract

Robotics datasets for imitation learning typically consist of long-horizon trajectories of different lengths over states, actions, and high-dimensional observations (e.g., RGB video), making it non-trivial to quantify diversity in a way that respects the underlying trajectory structure and geometry. We extend Shannon and von Neumann entropy to this setting by defining signature transform-based entropy on the Gram matrix of a signature kernel over demonstrations, yielding entropy and diversity metrics that operate directly on the demonstration dataset. Building on these metrics, we study how dataset diversity affects generalization performance in robot imitation learning and propose a simple, model-free way to curate diverse demonstrations. We introduce FAKTUAL (FAst trajectory Kernel enTropy cUration for imitation Learning), a data curation algorithm that selects a subset of demonstrations maximizing entropy given a subset-size budget. FAKTUAL is fully model-free, requires no access to the imitation policy or rollouts, and adds negligible overhead relative to policy training. We evaluate our approach on image and state-based RoboMimic and MetaWorld benchmarks, as well as four real-world manipulation tasks. Across tasks and architectures, diversity-aware curation with FAKTUAL consistently improves downstream success rates over random selection, while being substantially more computationally efficient compared to recent robot data curation methods. Our results suggest that the entropy of demonstration datasets is a practical tool for understanding and improving dataset diversity in robot imitation learning.

Diversity You Can Actually Measure: A Fast, Model-Free Diversity Metric for Robotics Datasets

Abstract

Robotics datasets for imitation learning typically consist of long-horizon trajectories of different lengths over states, actions, and high-dimensional observations (e.g., RGB video), making it non-trivial to quantify diversity in a way that respects the underlying trajectory structure and geometry. We extend Shannon and von Neumann entropy to this setting by defining signature transform-based entropy on the Gram matrix of a signature kernel over demonstrations, yielding entropy and diversity metrics that operate directly on the demonstration dataset. Building on these metrics, we study how dataset diversity affects generalization performance in robot imitation learning and propose a simple, model-free way to curate diverse demonstrations. We introduce FAKTUAL (FAst trajectory Kernel enTropy cUration for imitation Learning), a data curation algorithm that selects a subset of demonstrations maximizing entropy given a subset-size budget. FAKTUAL is fully model-free, requires no access to the imitation policy or rollouts, and adds negligible overhead relative to policy training. We evaluate our approach on image and state-based RoboMimic and MetaWorld benchmarks, as well as four real-world manipulation tasks. Across tasks and architectures, diversity-aware curation with FAKTUAL consistently improves downstream success rates over random selection, while being substantially more computationally efficient compared to recent robot data curation methods. Our results suggest that the entropy of demonstration datasets is a practical tool for understanding and improving dataset diversity in robot imitation learning.
Paper Structure (75 sections, 6 theorems, 72 equations, 13 figures, 5 tables, 3 algorithms)

This paper contains 75 sections, 6 theorems, 72 equations, 13 figures, 5 tables, 3 algorithms.

Key Result

Theorem 1

Consider the same setting as Definition sigT_formal, then for any $a\le s\le u\le t\le b$, Alternatively, for each level $k\ge 0$, where $\varphi(x)_{a,b}^{\ell}\in V^{\otimes \ell}$ is the $\ell$-th iterated integral.

Figures (13)

  • Figure 1: A graphical depiction of FAKTUAL. First, for each demonstration in the dataset, we embed the observations—specifically, the RGB images—into a ViT feature space, or alternatively extract low-dimensional representations of the objects of interest (e.g., point clouds) from the images. We then flatten any additional modalities present in the demonstrations, such as states and actions, if required, so that each demonstration can be represented as a paths or trajectories in space (detailed steps can be found in Appendix \ref{['App:demo_to_paths']}). Using these path representations, we compute signature-based entropy and diversity metrics, which we then use to select a subset of the dataset that maximizes the entropy or diversity.
  • Figure 2: Entropy vs Number of Demonstrations. In the Can task, the entropy saturates quickly and approaches an asymptote. In contrast, the Transport task entropy does not saturate over the range shown, indicating that the Transport task is more diverse (at least as discriminated by the signature entropy) and each demonstration contributes almost equally to the entropy, hence we see a weaker separation between the random and FAKTUAL lines. Note that the exact value of entropy varies greatly with the chosen kernel bandwidth; hence for brevity we choose a bandwidth of $1.0$ for this figure.
  • Figure 3: RoboMimic curation results. Success rates are computed over 50 rollouts for each of the 20 checkpoints, and we report the maximum across checkpoints. Results are averaged over three random seeds, and the error bars indicate the minimum and maximum values. On Can, FAKTUAL consistently outperforms random selection even at the lowest demo counts, and the gap is particularly large for PH, where performance improves sharply with very few demonstrations. For Square, the curves are largely indistinguishable at low demo counts; at higher counts, MH intervals still slightly overlap, while PH shows a small but consistent separation in favor of FAKTUAL. This PH advantage is plausibly due to PH comprising proficient-human, high-quality demonstrations, and to FAKTUAL operating as a diversity-based curation strategy rather than a data-quality–based curation method; (this is in line with the idea of uniform coverage/density discussed in Section \ref{['sec:intro']}, when the data quality is high). Transport is more challenging: at small budgets FAKTUAL is comparable to or marginally below random pooling, but trends toward an advantage as the demonstration count increases. It is possible that, FAKTUAL will eventually outperform the random pruning strategy as more demonstrations are added and the entropy begins to saturate (see Figure \ref{['fig:entropy_vs_num_demos']}).
  • Figure 4: Signature entropy correlates with success. Success rate versus Signature entropy across RoboMimic tasks (Can, Square, Transport) and dataset variants (MH/PH). Points are different curated subsets colored by the number of demonstrations; dotted lines show linear fits for random selection (green circles) and FAKTUAL (magenta squares). Pearson correlation coefficients ($r$) are reported in each panel, indicating a strong positive association between entropy and downstream success.
  • Figure 5: Metaworld curation results. Success rates are computed over 50 rollouts for each of the 20 checkpoints, and we report the maximum across checkpoints. Metaworld metaworld reports substantially different success rates across random seeds; accordingly, we plot only the mean success rate across 3 seeds here, and provide the full per-seed results in Table \ref{['tab:full_results_metaworld']}.
  • ...and 8 more figures

Theorems & Definitions (21)

  • Definition 1
  • Definition 2
  • Definition 3: Signature Shannon Entropy
  • Definition 4: Signature von Neumann Entropy
  • Definition 5: Signature-entropy–maximizing $m$-subset
  • Definition 6: Signature-determinant–maximizing $m$-subset
  • Definition 7
  • Theorem 1: Chen's identity
  • Definition 8
  • Theorem 2
  • ...and 11 more