VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing
Yixiao Wang, Mingxiao Huo, Zhixuan Liang, Yushi Du, Lingfeng Sun, Haotian Lin, Jinghuan Shang, Chensheng Peng, Mohit Bansal, Mingyu Ding, Masayoshi Tomizuka
TL;DR
VER addresses the generalization gap in robot learning by distilling multiple vision foundation models into a Vision Expert Library (VEL) and training a lightweight Robot Router to dynamically select task-relevant experts. A two-stage process—distillation from VFMs into VEL followed by downstream policy training with a frozen expert bank and a trainable router—enables scalable adaptation to robot-specific domains. Patchwise Expert Routing with Curriculum Top-K Annealing (CTA) improves exploration and reduces premature convergence, while maintaining low computational overhead ($<0.4\%$ of parameters for routing). Across 17 robotic tasks and multiple policy heads, VER achieves state-of-the-art results, concentrates attention on task-critical regions, and demonstrates strong extensibility by integrating new robot-domain knowledge through additional experts.
Abstract
Pretrained vision foundation models (VFMs) advance robotic learning via rich visual representations, yet individual VFMs typically excel only in specific domains, limiting generality across tasks. Distilling multiple VFMs into a unified representation for policy can mitigate this limitation but often yields inflexible task-specific feature selection and requires costly full re-training to incorporate robot-domain knowledge. We propose VER, a Vision Expert transformer for Robot learning. During pretraining, VER distills multiple VFMs into a vision expert library. It then fine-tunes only a lightweight routing network (fewer than 0.4% of parameters) to dynamically select task-relevant experts from the pretrained library for downstream robot tasks. We further introduce Patchwise Expert Routing with Curriculum Top-K Annealing to improve both flexibility and precision of dynamic expert selection. Moreover, VER supports parameter-efficient finetuning for scalable expert utilization and adaptive robot-domain knowledge integration. Across 17 diverse robotic tasks and multiple policy heads, VER achieves state-of-the-art performance. We find that VER reduces large-norm outliers in task-irrelevant regions (e.g., background) and concentrates on task-critical regions. Visualizations and codes can be found in https://yixiaowang7.github.io/ver_page/.
