LMD-PGN: Cross-Modal Knowledge Distillation from First-Person-View Images to Third-Person-View BEV Maps for Universal Point Goal Navigation

Riku Uemura; Kanji Tanaka; Kenta Tsukahara; Daiki Iwata

LMD-PGN: Cross-Modal Knowledge Distillation from First-Person-View Images to Third-Person-View BEV Maps for Universal Point Goal Navigation

Riku Uemura, Kanji Tanaka, Kenta Tsukahara, Daiki Iwata

TL;DR

This work tackles cross-platform point goal navigation (PGN) by introducing a cross-robot knowledge distillation framework that transfers FPV-based navigation knowledge to TPV local-map representations. The method redefines state using locally reconstructed maps via SLAM/SfM and maps actions to grid-based subgoals, enabling universal transfer to unknown or black-box robot platforms. A sampling-efficient KD approach is introduced through a local map descriptor and a rotation-invariant BEV coordinate system, with KD performed from a non-differentiable NNQL teacher to a differentiable MLP student using reciprocal rank vectors and KL divergence. Experiments in Habitat-Sim with 2D wheeled robots demonstrate the framework’s feasibility, with the distilled student achieving comparable or superior performance to the teacher in several settings, and the approach showing promise for extension to 3D platforms such as drones.

Abstract

Point goal navigation (PGN) is a mapless navigation approach that trains robots to visually navigate to goal points without relying on pre-built maps. Despite significant progress in handling complex environments using deep reinforcement learning, current PGN methods are designed for single-robot systems, limiting their generalizability to multi-robot scenarios with diverse platforms. This paper addresses this limitation by proposing a knowledge transfer framework for PGN, allowing a teacher robot to transfer its learned navigation model to student robots, including those with unknown or black-box platforms. We introduce a novel knowledge distillation (KD) framework that transfers first-person-view (FPV) representations (view images, turning/forward actions) to universally applicable third-person-view (TPV) representations (local maps, subgoals). The state is redefined as reconstructed local maps using SLAM, while actions are mapped to subgoals on a predefined grid. To enhance training efficiency, we propose a sampling-efficient KD approach that aligns training episodes via a noise-robust local map descriptor (LMD). Although validated on 2D wheeled robots, this method can be extended to 3D action spaces, such as drones. Experiments conducted in Habitat-Sim demonstrate the feasibility of the proposed framework, requiring minimal implementation effort. This study highlights the potential for scalable and cross-platform PGN solutions, expanding the applicability of embodied AI systems in multi-robot scenarios.

LMD-PGN: Cross-Modal Knowledge Distillation from First-Person-View Images to Third-Person-View BEV Maps for Universal Point Goal Navigation

TL;DR

Abstract

LMD-PGN: Cross-Modal Knowledge Distillation from First-Person-View Images to Third-Person-View BEV Maps for Universal Point Goal Navigation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)