Data Assessment for Embodied Intelligence
Jiahao Xiao, Bowen Yan, Jianbo Zhang, Jia Wang, Chunyi Li, Zhengxue Cheng, Guangtao Zhai
TL;DR
The paper addresses the quality of embodied datasets by proposing two data-driven tools: diversity entropy, computed from a unified multimodal representation to quantify information richness, and a training-free learnability estimator that predicts how easily information in the dataset can be learned. It introduces a 3-frame unified representation and employs a Parzen-window kernel entropy to measure dataset diversity, while a learnability framework combines intra-task factors (E_t and R_t) with inter-task transfer to compute L_{ ext{dataset}}. Validations on 21 embodied datasets (≈800 GB) and both simulated and real-world data show strong correlations between predicted and ground-truth learnability and reveal that increasing data volume alone yields diminishing returns without improving per-sample diversity. The methods enable immediate dataset assessment and guide data collection and generation toward higher information conversion and more transferable learning, supporting faster advancement in embodied intelligence.
Abstract
In embodied intelligence, datasets play a pivotal role, serving as both a knowledge repository and a conduit for information transfer. The two most critical attributes of a dataset are the amount of information it provides and how easily this information can be learned by models. However, the multimodal nature of embodied data makes evaluating these properties particularly challenging. Prior work has largely focused on diversity, typically counting tasks and scenes or evaluating isolated modalities, which fails to provide a comprehensive picture of dataset diversity. On the other hand, the learnability of datasets has received little attention and is usually assessed post-hoc through model training, an expensive, time-consuming process that also lacks interpretability, offering little guidance on how to improve a dataset. In this work, we address both challenges by introducing two principled, data-driven tools. First, we construct a unified multimodal representation for each data sample and, based on it, propose diversity entropy, a continuous measure that characterizes the amount of information contained in a dataset. Second, we introduce the first interpretable, data-driven algorithm to efficiently quantify dataset learnability without training, enabling researchers to assess a dataset's learnability immediately upon its release. We validate our algorithm on both simulated and real-world embodied datasets, demonstrating that it yields faithful, actionable insights that enable researchers to jointly improve diversity and learnability. We hope this work provides a foundation for designing higher-quality datasets that advance the development of embodied intelligence.
