Table of Contents
Fetching ...

VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing

Yixiao Wang, Mingxiao Huo, Zhixuan Liang, Yushi Du, Lingfeng Sun, Haotian Lin, Jinghuan Shang, Chensheng Peng, Mohit Bansal, Mingyu Ding, Masayoshi Tomizuka

TL;DR

VER addresses the generalization gap in robot learning by distilling multiple vision foundation models into a Vision Expert Library (VEL) and training a lightweight Robot Router to dynamically select task-relevant experts. A two-stage process—distillation from VFMs into VEL followed by downstream policy training with a frozen expert bank and a trainable router—enables scalable adaptation to robot-specific domains. Patchwise Expert Routing with Curriculum Top-K Annealing (CTA) improves exploration and reduces premature convergence, while maintaining low computational overhead ($<0.4\%$ of parameters for routing). Across 17 robotic tasks and multiple policy heads, VER achieves state-of-the-art results, concentrates attention on task-critical regions, and demonstrates strong extensibility by integrating new robot-domain knowledge through additional experts.

Abstract

Pretrained vision foundation models (VFMs) advance robotic learning via rich visual representations, yet individual VFMs typically excel only in specific domains, limiting generality across tasks. Distilling multiple VFMs into a unified representation for policy can mitigate this limitation but often yields inflexible task-specific feature selection and requires costly full re-training to incorporate robot-domain knowledge. We propose VER, a Vision Expert transformer for Robot learning. During pretraining, VER distills multiple VFMs into a vision expert library. It then fine-tunes only a lightweight routing network (fewer than 0.4% of parameters) to dynamically select task-relevant experts from the pretrained library for downstream robot tasks. We further introduce Patchwise Expert Routing with Curriculum Top-K Annealing to improve both flexibility and precision of dynamic expert selection. Moreover, VER supports parameter-efficient finetuning for scalable expert utilization and adaptive robot-domain knowledge integration. Across 17 diverse robotic tasks and multiple policy heads, VER achieves state-of-the-art performance. We find that VER reduces large-norm outliers in task-irrelevant regions (e.g., background) and concentrates on task-critical regions. Visualizations and codes can be found in https://yixiaowang7.github.io/ver_page/.

VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing

TL;DR

VER addresses the generalization gap in robot learning by distilling multiple vision foundation models into a Vision Expert Library (VEL) and training a lightweight Robot Router to dynamically select task-relevant experts. A two-stage process—distillation from VFMs into VEL followed by downstream policy training with a frozen expert bank and a trainable router—enables scalable adaptation to robot-specific domains. Patchwise Expert Routing with Curriculum Top-K Annealing (CTA) improves exploration and reduces premature convergence, while maintaining low computational overhead ( of parameters for routing). Across 17 robotic tasks and multiple policy heads, VER achieves state-of-the-art results, concentrates attention on task-critical regions, and demonstrates strong extensibility by integrating new robot-domain knowledge through additional experts.

Abstract

Pretrained vision foundation models (VFMs) advance robotic learning via rich visual representations, yet individual VFMs typically excel only in specific domains, limiting generality across tasks. Distilling multiple VFMs into a unified representation for policy can mitigate this limitation but often yields inflexible task-specific feature selection and requires costly full re-training to incorporate robot-domain knowledge. We propose VER, a Vision Expert transformer for Robot learning. During pretraining, VER distills multiple VFMs into a vision expert library. It then fine-tunes only a lightweight routing network (fewer than 0.4% of parameters) to dynamically select task-relevant experts from the pretrained library for downstream robot tasks. We further introduce Patchwise Expert Routing with Curriculum Top-K Annealing to improve both flexibility and precision of dynamic expert selection. Moreover, VER supports parameter-efficient finetuning for scalable expert utilization and adaptive robot-domain knowledge integration. Across 17 diverse robotic tasks and multiple policy heads, VER achieves state-of-the-art performance. We find that VER reduces large-norm outliers in task-irrelevant regions (e.g., background) and concentrates on task-critical regions. Visualizations and codes can be found in https://yixiaowang7.github.io/ver_page/.

Paper Structure

This paper contains 35 sections, 7 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: A comparison between our VER and previous distillation framework. Our method not only enhances knowledge distillation from vision foundation models (VFMs) into vision experts but also offers two key advantages over previous works radioshang2024theia. First, VER trains a lightweight router that dynamically selects vision experts for downstream robot policies. Second, VER allows the integration of additional trainable experts, enabling the adaptation to robot-specific domain knowledge to further improve robotic performance.
  • Figure 2: Overall structure of VER. VER comprises two key components: the Base Vision Transformer (BVT), which processes images into unified representations; the Vision Expert Library (VEL), which stores a diverse set of specialized vision experts and selectively utilizes the experts to mimic teacher vision foundation models and enhance performance in downstream robotic tasks. Our framework consists of two phases: (1) Pretraining, where we distill multiple foundation models (DINOv2 oquab2024dinov2, ViT caron2021emerging, CLIP radford2021learning) into VER; (2) Downstream Robotic Tasks, where we freeze the experts and train a lightweight Robot Router ($<0.4\%$ parameters) that dynamically selects task-relevant visual features to guide the policy head in generating appropriate robotic actions. This two-stage approach enables efficient knowledge distillation from diverse vision foundation models and adaptive feature selection for robotic tasks.
  • Figure 3: Cosine loss for DINOv2 distillation. Circle size indicates total parameters (TP).
  • Figure 4: Expert utilization frequency across three MoE layers. Heatmap shows how each teacher model activates experts (1–6) during distillation on ImageNet-1K.
  • Figure 5: Visualization of real world experiments. We find with human interference (not in the training dataset), our VER can successfully complete the task.
  • ...and 10 more figures