Enhancing Vision-Based Policies with Omni-View and Cross-Modality Knowledge Distillation for Mobile Robots

Kai Li; Shiyu Zhao

Enhancing Vision-Based Policies with Omni-View and Cross-Modality Knowledge Distillation for Mobile Robots

Kai Li, Shiyu Zhao

Abstract

Vision-based policies are widely applied in robotics for tasks such as manipulation and locomotion. On lightweight mobile robots, however, they face a trilemma of limited scene transferability, restricted onboard computation resources, and sensor hardware cost. To address these issues, we propose a knowledge distillation approach that transfers knowledge from an information-rich, appearance invariant omniview depth policy to a lightweight monocular policy. The key idea is to train the student not only to mimic the expert actions but also to align with the latent embeddings of the omni view depth teacher. Experiments demonstrate that omni-view and depth inputs improve the scene transfer and navigation performance, and that the proposed distillation method enhances the performance of a singleview monocular policy, compared with policies solely imitating actions. Real world experiments further validate the effectiveness and practicality of our approach. Code will be released publicly.

Enhancing Vision-Based Policies with Omni-View and Cross-Modality Knowledge Distillation for Mobile Robots

Abstract

Paper Structure (15 sections, 7 equations, 8 figures, 3 tables)

This paper contains 15 sections, 7 equations, 8 figures, 3 tables.

Introduction
Related Works
Proposed Method
Problem Formulation
Overview
Teacher Policy Training
Student Policy Training
Knowledge Distillation via Contrastive Learning
Expert Policy for Data Generation
Experimental Evaluations
Task Description and Implementation Details
Scene Transfer Performance
Mobile Robot Task Performance
Comparisons With Other Methods
Conclusions

Figures (8)

Figure 1: The teacher policy (left) leverages appearance-invariant omnidirectional depth images generated by concatenating multi-view inputs, while the student (right) relies on a single-view RGB image. Both the teacher and the student imitate the action $\mathbf{a}$ from the expert data, with the student additionally distilling the feature embedding $\mathbf{z}$ from the teacher. Compared to the teacher, the student policy is computationally lightweight and more suitable for deployment on lightweight low-cost mobile robots.
Figure 2: Overview of the proposed method. (a) shows the knowledge transfer flow from the state-based expert, to the omnidirectional-depth-based teacher, and the RGB-based student. (b) shows the detailed pipeline of our method. In contrast with the vanilla form of IL, our method imitates both the action output and intermediate visual embeddings, which is highlighted in the dash-line boxes. LP in the teacher denotes linear projection, which is used for embedding dimension matching.
Figure 3: Omnidirectional RGB and depth images from DATv2. The top row shows the simulation and the bottom row shows the real world. Omni-view is formed by concatenating multi-camera images. Camera 1 is used for the single-view student policy.
Figure 4: Embedding similarity comparison of different image encoders across scenes. Higher similarity values indicate more consistent embeddings and stronger scene transferability. $\mu$ is the mean value of similarity.
Figure 5: t-SNEmaaten2008visualizing visualization of visual embeddings of different scenes from DINOv2oquab2024dinov2, Zhang et al.zhang2025learning and ours. Different colors represent different scenes.
...and 3 more figures

Enhancing Vision-Based Policies with Omni-View and Cross-Modality Knowledge Distillation for Mobile Robots

Abstract

Enhancing Vision-Based Policies with Omni-View and Cross-Modality Knowledge Distillation for Mobile Robots

Authors

Abstract

Table of Contents

Figures (8)