CATNAV: Cached Vision-Language Traversability for Efficient Zero-Shot Robot Navigation

Aditya Potnis; Francisco Affonso; Shreya Gummadi; Naveen Kumar Uppalapati; Girish Chowdhary

CATNAV: Cached Vision-Language Traversability for Efficient Zero-Shot Robot Navigation

Aditya Potnis, Francisco Affonso, Shreya Gummadi, Naveen Kumar Uppalapati, Girish Chowdhary

Abstract

Navigating unstructured environments requires assessing traversal risk relative to a robot's physical capabilities, a challenge that varies across embodiments. We present CATNAV, a cost-aware traversability navigation framework that leverages multimodal LLMs for zero-shot, embodiment-aware costmap generation without task-specific training. We introduce a visuosemantic caching mechanism that detects scene novelty and reuses prior risk assessments for semantically similar frames, reducing online VLM queries by 85.7%. Furthermore, we introduce a VLM-based trajectory selection module that evaluates proposals through visual reasoning to choose the safest path given behavioral constraints. We evaluate CATNAV on a quadruped robot across indoor and outdoor unstructured environments, comparing against state-of-the-art vision-language-action baselines. Across five navigation tasks, CATNAV achieves 10 percentage point higher average goal-reaching rate and 33% fewer behavioral constraint violations.

CATNAV: Cached Vision-Language Traversability for Efficient Zero-Shot Robot Navigation

Abstract

Paper Structure (28 sections, 3 equations, 6 figures, 3 tables)

This paper contains 28 sections, 3 equations, 6 figures, 3 tables.

Introduction
Related Work
Traversability Estimation
Vision-Language Model assisted Navigation
Vision-Language-Action Models for Navigation
Method
Scene Perception and Risk Inference
Cost Estimation
Novelty Detection & Risk-Score Caching
Costmap Construction
Open-Vocabulary Segmentation and Cost Projection
Risk-Scored Point Cloud Generation
2D Occupancy Costmap
Goal Specification
Trajectory Generation
...and 13 more sections

Figures (6)

Figure 1: We present CATNAV, a zero-shot cost and embodiment aware traversability navigation method that is able to navigate in unknown environments without fine-tuning.
Figure 2: Overview of the CATNAV pipeline framework for robot navigation. The system integrates real-time scene perception with risk inference, utilizing a Novelty Check module to determine if a scene requires a new VLM-based costmap prediction or can be processed via cached risks. Following Costmap Construction and Multi-Proposal TRRT Planning, the Trajectory Reasoning module employs a Large Language Model (LLM) to evaluate the proposed paths against specific robot modalities and required behaviors, ultimately selecting the single optimal trajectory for execution.
Figure 3: Scenarios samples: (a) outdoor footpath navigation (Tasks 1--2), (b) outdoor navigation with obstacles (Task 4), (c) dynamic human crossing (Task 3), (d) indoor paper avoidance (Task 5).
Figure 4: Distribution of VLM query frequencies across different caching configurations. The histograms represent the raw query counts, overlaid with Gaussian approximations scaled to the total number of samples and bin width for each test. A reference fixed frequency is plotted based on the fixed polling baseline.
Figure 5: Qualitative results of the costmap segmentation using CATNAV's visuosemantic cache. The cost table is aggregated from the $k$-nearest neighbors of the current scene's CLIP embedding. If no neighbors are found within the novelty threshold $\gamma$, the scene is classified as "Novel," queried via the LLM, and appended to the vector store.
...and 1 more figures

CATNAV: Cached Vision-Language Traversability for Efficient Zero-Shot Robot Navigation

Abstract

CATNAV: Cached Vision-Language Traversability for Efficient Zero-Shot Robot Navigation

Authors

Abstract

Table of Contents

Figures (6)