Table of Contents
Fetching ...

GRIP: A Unified Framework for Grid-Based Relay and Co-Occurrence-Aware Planning in Dynamic Environments

Ahmed Alanazi, Duy Ho, Yugyung Lee

TL;DR

GRIP presents a unified framework for grid-based relay and co-occurrence-aware planning in dynamic environments, integrating dynamic semantic mapping, symbolic reasoning, and LLM-guided introspection. It introduces a Dynamic Open-Vocabulary Scene Graph (DovSG), a co-occurrence knowledge graph for relay anchoring, a semantic occupancy grid for planning, and a hierarchical loop that blends transformer-based subgoal prediction with D* replanning. The paper demonstrates three variants—GRIP-L, GRIP-F, and GRIP-R—covering simulation, simulation-with-occlusion scenarios, and real-world deployment on a Jetbot Pro, achieving meaningful improvements in SR, SPL, and SAE across AI2-THOR, RoboTHOR, and real environments. The results highlight GRIP's genericity, interpretability, and sim-to-real transfer, with symbolic introspection and recovery enabling robust navigation under occlusion and semantic ambiguity. Collectively, GRIP advances open-vocabulary grounding, symbolic planning, and real-world embodied AI by delivering a scalable, explainable, and adaptable navigation framework.

Abstract

Robots navigating dynamic, cluttered, and semantically complex environments must integrate perception, symbolic reasoning, and spatial planning to generalize across diverse layouts and object categories. Existing methods often rely on static priors or limited memory, constraining adaptability under partial observability and semantic ambiguity. We present GRIP, Grid-based Relay with Intermediate Planning, a unified, modular framework with three scalable variants: GRIP-L (Lightweight), optimized for symbolic navigation via semantic occupancy grids; GRIP-F (Full), supporting multi-hop anchor chaining and LLM-based introspection; and GRIP-R (Real-World), enabling physical robot deployment under perceptual uncertainty. GRIP integrates dynamic 2D grid construction, open-vocabulary object grounding, co-occurrence-aware symbolic planning, and hybrid policy execution using behavioral cloning, D* search, and grid-conditioned control. Empirical results on AI2-THOR and RoboTHOR benchmarks show that GRIP achieves up to 9.6% higher success rates and over $2\times$ improvement in path efficiency (SPL and SAE) on long-horizon tasks. Qualitative analyses reveal interpretable symbolic plans in ambiguous scenes. Real-world deployment on a Jetbot further validates GRIP's generalization under sensor noise and environmental variation. These results position GRIP as a robust, scalable, and explainable framework bridging simulation and real-world navigation.

GRIP: A Unified Framework for Grid-Based Relay and Co-Occurrence-Aware Planning in Dynamic Environments

TL;DR

GRIP presents a unified framework for grid-based relay and co-occurrence-aware planning in dynamic environments, integrating dynamic semantic mapping, symbolic reasoning, and LLM-guided introspection. It introduces a Dynamic Open-Vocabulary Scene Graph (DovSG), a co-occurrence knowledge graph for relay anchoring, a semantic occupancy grid for planning, and a hierarchical loop that blends transformer-based subgoal prediction with D* replanning. The paper demonstrates three variants—GRIP-L, GRIP-F, and GRIP-R—covering simulation, simulation-with-occlusion scenarios, and real-world deployment on a Jetbot Pro, achieving meaningful improvements in SR, SPL, and SAE across AI2-THOR, RoboTHOR, and real environments. The results highlight GRIP's genericity, interpretability, and sim-to-real transfer, with symbolic introspection and recovery enabling robust navigation under occlusion and semantic ambiguity. Collectively, GRIP advances open-vocabulary grounding, symbolic planning, and real-world embodied AI by delivering a scalable, explainable, and adaptable navigation framework.

Abstract

Robots navigating dynamic, cluttered, and semantically complex environments must integrate perception, symbolic reasoning, and spatial planning to generalize across diverse layouts and object categories. Existing methods often rely on static priors or limited memory, constraining adaptability under partial observability and semantic ambiguity. We present GRIP, Grid-based Relay with Intermediate Planning, a unified, modular framework with three scalable variants: GRIP-L (Lightweight), optimized for symbolic navigation via semantic occupancy grids; GRIP-F (Full), supporting multi-hop anchor chaining and LLM-based introspection; and GRIP-R (Real-World), enabling physical robot deployment under perceptual uncertainty. GRIP integrates dynamic 2D grid construction, open-vocabulary object grounding, co-occurrence-aware symbolic planning, and hybrid policy execution using behavioral cloning, D* search, and grid-conditioned control. Empirical results on AI2-THOR and RoboTHOR benchmarks show that GRIP achieves up to 9.6% higher success rates and over improvement in path efficiency (SPL and SAE) on long-horizon tasks. Qualitative analyses reveal interpretable symbolic plans in ambiguous scenes. Real-world deployment on a Jetbot further validates GRIP's generalization under sensor noise and environmental variation. These results position GRIP as a robust, scalable, and explainable framework bridging simulation and real-world navigation.

Paper Structure

This paper contains 69 sections, 22 equations, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: GRIP-F Architecture. The full symbolic configuration integrates RGB(-D) input, dynamic memory, grid-based planning, symbolic chaining, and LLM recovery, supporting robust navigation under occlusion and ambiguity.
  • Figure 2: GRIP-R real-world execution. Symbolic reasoning enables robust navigation under uncertainty.
  • Figure 3: Failure Cases: GRIP-F fails due to semantic-spatial mismatches. Anchors were semantically relevant but spatially unreachable or misleading.
  • Figure 4: Object-Wise Comparison of SR, SPL, and SAE.Top: GRIP-L performance in AI2-THOR. Bottom: GRIP-F performance in RoboTHOR. Metrics are reported per object class for baseline and L5-enhanced configurations.
  • Figure 5: Success Cases: GRIP-F reaches targets using symbolic anchor sequences despite visual ambiguity.