Direct Contact-Tolerant Motion Planning With Vision Language Models

He Li; Jian Sun; Chengyang Li; Guoliang Li; Qiyu Ruan; Shuai Wang; Chengzhong Xu

Direct Contact-Tolerant Motion Planning With Vision Language Models

He Li, Jian Sun, Chengyang Li, Guoliang Li, Qiyu Ruan, Shuai Wang, Chengzhong Xu

TL;DR

A direct contact-tolerant (DCT) planner, which integrates vision-language models (VLMs) into direct point perception and navigation, including two key components, which formulates CTMP as a perception-to-control optimization problem under direct contact-aware point cloud constraints.

Abstract

Navigation in cluttered environments often requires robots to tolerate contact with movable or deformable objects to maintain efficiency. Existing contact-tolerant motion planning (CTMP) methods rely on indirect spatial representations (e.g., prebuilt map, obstacle set), resulting in inaccuracies and a lack of adaptiveness to environmental uncertainties. To address this issue, we propose a direct contact-tolerant (DCT) planner, which integrates vision-language models (VLMs) into direct point perception and navigation, including two key components. The first one is VLM point cloud partitioner (VPP), which performs contact-tolerance reasoning in image space using VLM, caches inference masks, propagates them across frames using odometry, and projects them onto the current scan to generate a contact-aware point cloud. The second innovation is VPP guided navigation (VGN), which formulates CTMP as a perception-to-control optimization problem under direct contact-aware point cloud constraints, which is further solved by a specialized deep neural network (DNN). We implement DCT in Isaac Sim and a real car-like robot, demonstrating that DCT achieves robust and efficient navigation in cluttered environments with movable obstacles, outperforming representative baselines across diverse metrics. The code is available at: https://github.com/ChrisLeeUM/DCT.

Direct Contact-Tolerant Motion Planning With Vision Language Models

TL;DR

Abstract

Paper Structure (26 sections, 14 equations, 12 figures, 2 tables, 1 algorithm)

This paper contains 26 sections, 14 equations, 12 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Problem Statement
Task Description
Contact-Tolerant Collision Avoidance
Direct Contact-Tolerant Motion Planning
System Overview
Movable Obstacle Grounding with VPP
VLM-Driven Obstacle Filter
Memory-Driven Mask Generation
Phase I: Viewpoint warping
Phase II: Detection-driven reconciliation
Phase III: Point-level refinement and publishing
Fast Motion Planning with VGN
Learned Distance
...and 11 more sections

Figures (12)

Figure 1: DCT motion planning with VLM.
Figure 2: System architecture of DCT, which consists of VPP and VGN modules.
Figure 3: Memory-driven mask generation.
Figure 4: Deep unfolded neural network.
Figure 5: Simulation setup.
...and 7 more figures

Direct Contact-Tolerant Motion Planning With Vision Language Models

TL;DR

Abstract

Direct Contact-Tolerant Motion Planning With Vision Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (12)