Table of Contents
Fetching ...

CATNAV: Cached Vision-Language Traversability for Efficient Zero-Shot Robot Navigation

Aditya Potnis, Francisco Affonso, Shreya Gummadi, Naveen Kumar Uppalapati, Girish Chowdhary

Abstract

Navigating unstructured environments requires assessing traversal risk relative to a robot's physical capabilities, a challenge that varies across embodiments. We present CATNAV, a cost-aware traversability navigation framework that leverages multimodal LLMs for zero-shot, embodiment-aware costmap generation without task-specific training. We introduce a visuosemantic caching mechanism that detects scene novelty and reuses prior risk assessments for semantically similar frames, reducing online VLM queries by 85.7%. Furthermore, we introduce a VLM-based trajectory selection module that evaluates proposals through visual reasoning to choose the safest path given behavioral constraints. We evaluate CATNAV on a quadruped robot across indoor and outdoor unstructured environments, comparing against state-of-the-art vision-language-action baselines. Across five navigation tasks, CATNAV achieves 10 percentage point higher average goal-reaching rate and 33% fewer behavioral constraint violations.

CATNAV: Cached Vision-Language Traversability for Efficient Zero-Shot Robot Navigation

Abstract

Navigating unstructured environments requires assessing traversal risk relative to a robot's physical capabilities, a challenge that varies across embodiments. We present CATNAV, a cost-aware traversability navigation framework that leverages multimodal LLMs for zero-shot, embodiment-aware costmap generation without task-specific training. We introduce a visuosemantic caching mechanism that detects scene novelty and reuses prior risk assessments for semantically similar frames, reducing online VLM queries by 85.7%. Furthermore, we introduce a VLM-based trajectory selection module that evaluates proposals through visual reasoning to choose the safest path given behavioral constraints. We evaluate CATNAV on a quadruped robot across indoor and outdoor unstructured environments, comparing against state-of-the-art vision-language-action baselines. Across five navigation tasks, CATNAV achieves 10 percentage point higher average goal-reaching rate and 33% fewer behavioral constraint violations.
Paper Structure (28 sections, 3 equations, 6 figures, 3 tables)

This paper contains 28 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: We present CATNAV, a zero-shot cost and embodiment aware traversability navigation method that is able to navigate in unknown environments without fine-tuning.
  • Figure 2: Overview of the CATNAV pipeline framework for robot navigation. The system integrates real-time scene perception with risk inference, utilizing a Novelty Check module to determine if a scene requires a new VLM-based costmap prediction or can be processed via cached risks. Following Costmap Construction and Multi-Proposal TRRT Planning, the Trajectory Reasoning module employs a Large Language Model (LLM) to evaluate the proposed paths against specific robot modalities and required behaviors, ultimately selecting the single optimal trajectory for execution.
  • Figure 3: Scenarios samples: (a) outdoor footpath navigation (Tasks 1--2), (b) outdoor navigation with obstacles (Task 4), (c) dynamic human crossing (Task 3), (d) indoor paper avoidance (Task 5).
  • Figure 4: Distribution of VLM query frequencies across different caching configurations. The histograms represent the raw query counts, overlaid with Gaussian approximations scaled to the total number of samples and bin width for each test. A reference fixed frequency is plotted based on the fixed polling baseline.
  • Figure 5: Qualitative results of the costmap segmentation using CATNAV's visuosemantic cache. The cost table is aggregated from the $k$-nearest neighbors of the current scene's CLIP embedding. If no neighbors are found within the novelty threshold $\gamma$, the scene is classified as "Novel," queried via the LLM, and appended to the vector store.
  • ...and 1 more figures