Theia: Distilling Diverse Vision Foundation Models for Robot Learning

Jinghuan Shang; Karl Schmeckpeper; Brandon B. May; Maria Vittoria Minniti; Tarik Kelestemur; David Watkins; Laura Herlant

Theia: Distilling Diverse Vision Foundation Models for Robot Learning

Jinghuan Shang, Karl Schmeckpeper, Brandon B. May, Maria Vittoria Minniti, Tarik Kelestemur, David Watkins, Laura Herlant

TL;DR

Theia introduces a distillation framework that fuses multiple vision foundation models into a compact robot vision backbone, yielding richer spatial representations for downstream policy learning. By training feature translators to align a small backbone’s spatial tokens with diverse teacher features and preserving patch-level information, Theia achieves superior performance with far less data and compute than prior approaches. Extensive CortexBench and real-world experiments show Theia outperforms individual VFMs and other distillation baselines across simulated and physical tasks, while analyses reveal a strong link between high entropy in feature-norm distributions and robot-learning effectiveness. The work provides practical insights into token-level distillation, distributed visual knowledge, and how representation quality, quantified via entropy, translates into improved robotic control.

Abstract

Vision-based robot policy learning, which maps visual inputs to actions, necessitates a holistic understanding of diverse visual tasks beyond single-task needs like classification or segmentation. Inspired by this, we introduce Theia, a vision foundation model for robot learning that distills multiple off-the-shelf vision foundation models trained on varied vision tasks. Theia's rich visual representations encode diverse visual knowledge, enhancing downstream robot learning. Extensive experiments demonstrate that Theia outperforms its teacher models and prior robot learning models using less training data and smaller model sizes. Additionally, we quantify the quality of pre-trained visual representations and hypothesize that higher entropy in feature norm distributions leads to improved robot learning performance. Code, models, and demo are available at https://theia.theaiinstitute.com.

Theia: Distilling Diverse Vision Foundation Models for Robot Learning

TL;DR

Abstract

Paper Structure (58 sections, 2 equations, 16 figures, 18 tables)

This paper contains 58 sections, 2 equations, 16 figures, 18 tables.

Introduction
Related Work
Visual Representations for Robot Learning
Vision Foundation Models
Knowledge Distillation in Vision Models
Method
Overview.
Architecture.
Rich Spatial Representation
Feature Translators.
Training
Distillation Objective.
Feature Normalization.
Dataset.
Experiments
...and 43 more sections

Figures (16)

Figure 1: We introduce Theia, a model that distills multiple vision foundation models (VFMs) to provide better representations for robot learning (left). Theia achieves superior performance on robot learning tasks with less computation compared to standard VFMs and pre-trained models for robotics (right). Results shown are from the MuJoCo subset of tasks in CortexBench.
Figure 2: Theia distills multiple VFM features into one rich representation for robot learning. The feature translators $g_i(\mathbf{z})$ are supervised by the features from pretrained VFMs $h_i(\mathbf{x})$ during training time, then the distilled representation $\mathbf{z}$ is used as input to the policy head for robot learning tasks.
Figure 3: Simulation and real-world (labeled in blue) tasks used in this work. For simulated environments we show one image per task. For real-world tasks, we show images of key steps throughout the task labeled by numbers. A third-person view image shows the setup in Drawer Opening.
Figure 4: Performance on MuJoCo tasks vs. inference computation. Theia achieves the best performance with much less compute (MACs (G): Multiply-Accumulate operations in billions, log-scale).
Figure 5: MuJoCo subset performance with respect to different combinations of teacher models to train Theia-T. Abbreviations of teacher models: V=ViT-H, C=CLIP-L, S=SAM-H, Di=DINOv2-L, De=Depth-Anything-L, All=all of five models (CDeDiSV), and All$-$X=taking X out of All.
...and 11 more figures

Theia: Distilling Diverse Vision Foundation Models for Robot Learning

TL;DR

Abstract

Theia: Distilling Diverse Vision Foundation Models for Robot Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (16)