Table of Contents
Fetching ...

Online Robot Navigation and Manipulation with Distilled Vision-Language Models

Kangcheng Liu

TL;DR

The paper tackles autonomous navigation in dense, dynamic environments with unknown objects by coupling open-vocabulary perception with efficient edge-ready deployment. It introduces a regional vision-language framework that fuses visual and linguistic cues via a bi-directional transformer and regional matching, enabling zero-shot recognition beyond closed-set categories. To meet real-time constraints on embedded hardware, the authors propose a distillation pipeline and a structural trimming strategy, achieving lightweight yet effective perception for navigation when integrated with LiDAR-Inertial SLAM and motion planning. Extensive benchmarks and real-world robot experiments demonstrate improved open-world recognition accuracy and substantial gains in inference speed, validating the practical viability of deploying vision-language models in mobile robotics.

Abstract

Autonomous robot navigation within the dynamic unknown environment is of crucial significance for mobile robotic applications including robot navigation in last-mile delivery and robot-enabled automated supplies in industrial and hospital delivery applications. Current solutions still suffer from limitations, such as the robot cannot recognize unknown objects in real-time and cannot navigate freely in a dynamic, narrow, and complex environment. We propose a complete software framework for autonomous robot perception and navigation within very dense obstacles and dense human crowds. First, we propose a framework that accurately detects and segments open-world object categories in a zero-shot manner, which overcomes the over-segmentation limitation of the current SAM model. Second, we proposed the distillation strategy to distill the knowledge to segment the free space of the walkway for robot navigation without the label. In the meantime, we design the trimming strategy that works collaboratively with distillation to enable lightweight inference to deploy the neural network on edge devices such as NVIDIA-TX2 or Xavier NX during autonomous navigation. Integrated into the robot navigation system, extensive experiments demonstrate that our proposed framework has achieved superior performance in terms of both accuracy and efficiency in robot scene perception and autonomous robot navigation.

Online Robot Navigation and Manipulation with Distilled Vision-Language Models

TL;DR

The paper tackles autonomous navigation in dense, dynamic environments with unknown objects by coupling open-vocabulary perception with efficient edge-ready deployment. It introduces a regional vision-language framework that fuses visual and linguistic cues via a bi-directional transformer and regional matching, enabling zero-shot recognition beyond closed-set categories. To meet real-time constraints on embedded hardware, the authors propose a distillation pipeline and a structural trimming strategy, achieving lightweight yet effective perception for navigation when integrated with LiDAR-Inertial SLAM and motion planning. Extensive benchmarks and real-world robot experiments demonstrate improved open-world recognition accuracy and substantial gains in inference speed, validating the practical viability of deploying vision-language models in mobile robotics.

Abstract

Autonomous robot navigation within the dynamic unknown environment is of crucial significance for mobile robotic applications including robot navigation in last-mile delivery and robot-enabled automated supplies in industrial and hospital delivery applications. Current solutions still suffer from limitations, such as the robot cannot recognize unknown objects in real-time and cannot navigate freely in a dynamic, narrow, and complex environment. We propose a complete software framework for autonomous robot perception and navigation within very dense obstacles and dense human crowds. First, we propose a framework that accurately detects and segments open-world object categories in a zero-shot manner, which overcomes the over-segmentation limitation of the current SAM model. Second, we proposed the distillation strategy to distill the knowledge to segment the free space of the walkway for robot navigation without the label. In the meantime, we design the trimming strategy that works collaboratively with distillation to enable lightweight inference to deploy the neural network on edge devices such as NVIDIA-TX2 or Xavier NX during autonomous navigation. Integrated into the robot navigation system, extensive experiments demonstrate that our proposed framework has achieved superior performance in terms of both accuracy and efficiency in robot scene perception and autonomous robot navigation.
Paper Structure (14 sections, 10 equations, 10 figures, 2 tables, 1 algorithm)

This paper contains 14 sections, 10 equations, 10 figures, 2 tables, 1 algorithm.

Figures (10)

  • Figure 1: Teaser: The autonomous navigation experiments in real-world situations. It can be demonstrated that our proposed approach can provide accurate segmentation results of the free space of the road and maintain real-time efficiency in the meantime.
  • Figure 2: We have distilled the knowledge to a lightweight model that can run on the robot's onboard computer to help the robot navigate in real-world circumstances. Also, the robot grasping based on the open-vocabulary language prompted input can be realized. Segmentation comparisons in the open-world scenarios. Compared with the current prevailing SAM model, our proposed approach captures more holistic object semantic information.
  • Figure 3: The open-vocabulary detection results in real-world complicated scenes. Previous vision-language models CLIP radford2021learning can merely deal with the task of image classification and can not tackle the detection and segmentation required in robotic applications. While the SAM kirillov2023segment model focuses too much on fine-grained details and suffers from over-segmentation. Our proposed approach captures object-level information by region proposals and facilitates precise visual-language association through regional contrastive representation learning, which allows precise vision-language association at the regional level. Moreover, we design a modality interaction network to explore relations between the visual and linguistic modality. Also, it boosts the fusion of vision and linguistic features. According to our experiments on both public benchmarks and real-world experiments, these designs demonstrate superior open-vocabulary recognition accuracy and lead to successful autonomous robot navigation in real-world complex scenarios.
  • Figure 4: The detailed structure of the proposed modality interaction Transformer network. The proposed network is simple but effective in capturing as well as modeling the rich cross-modality feature relations and interactions within the vision and linguistic modality.
  • Figure 5: Our final integrated system framework achieves autonomous robot navigation in real-world environments. We first propose an open vocabulary recognition approach that recognizes unseen novel categories. Next, we distill the knowledge from the open-vocabulary model for free space recognition of the road, and proposed network trimming approaches to achieve real-time performance on the robot onboard computer. Integrated with the system framework depicted above which is extended from our previous work liu2023dlc, we perform autonomous language-guided navigation.
  • ...and 5 more figures