Theia: Distilling Diverse Vision Foundation Models for Robot Learning
Jinghuan Shang, Karl Schmeckpeper, Brandon B. May, Maria Vittoria Minniti, Tarik Kelestemur, David Watkins, Laura Herlant
TL;DR
Theia introduces a distillation framework that fuses multiple vision foundation models into a compact robot vision backbone, yielding richer spatial representations for downstream policy learning. By training feature translators to align a small backbone’s spatial tokens with diverse teacher features and preserving patch-level information, Theia achieves superior performance with far less data and compute than prior approaches. Extensive CortexBench and real-world experiments show Theia outperforms individual VFMs and other distillation baselines across simulated and physical tasks, while analyses reveal a strong link between high entropy in feature-norm distributions and robot-learning effectiveness. The work provides practical insights into token-level distillation, distributed visual knowledge, and how representation quality, quantified via entropy, translates into improved robotic control.
Abstract
Vision-based robot policy learning, which maps visual inputs to actions, necessitates a holistic understanding of diverse visual tasks beyond single-task needs like classification or segmentation. Inspired by this, we introduce Theia, a vision foundation model for robot learning that distills multiple off-the-shelf vision foundation models trained on varied vision tasks. Theia's rich visual representations encode diverse visual knowledge, enhancing downstream robot learning. Extensive experiments demonstrate that Theia outperforms its teacher models and prior robot learning models using less training data and smaller model sizes. Additionally, we quantify the quality of pre-trained visual representations and hypothesize that higher entropy in feature norm distributions leads to improved robot learning performance. Code, models, and demo are available at https://theia.theaiinstitute.com.
