A Multi-Modal Neuro-Symbolic Approach for Spatial Reasoning-Based Visual Grounding in Robotics

Simindokht Jahangard; Mehrzad Mohammadi; Abhinav Dhall; Hamid Rezatofighi

A Multi-Modal Neuro-Symbolic Approach for Spatial Reasoning-Based Visual Grounding in Robotics

Simindokht Jahangard, Mehrzad Mohammadi, Abhinav Dhall, Hamid Rezatofighi

TL;DR

The paper tackles the difficulty of fine-grained spatial reasoning in robotics by introducing a lightweight neuro-symbolic framework that fuses panoramic RGB-D imagery with 3D point clouds and grounds reasoning on an explicit scene graph. The perception module detects entities and extracts attributes with Florence and InternVL3.5, while the projection module fuses semantic features with 3D geometry to form a geometry-aware graph. A graph-search module performs query-driven reasoning through a two-phase Hybrid Attribute–Relational Filtering algorithm, enabling accurate visual grounding and VQA with interpretable relations. Demonstrated on JRDB-Reasoning, the approach achieves superior performance with only about 1.3B parameters, outperforming larger VLM baselines in complex relational queries and suggesting strong potential for real-world embodied AI applications.

Abstract

Visual reasoning, particularly spatial reasoning, is a challenging cognitive task that requires understanding object relationships and their interactions within complex environments, especially in robotics domain. Existing vision_language models (VLMs) excel at perception tasks but struggle with fine-grained spatial reasoning due to their implicit, correlation-driven reasoning and reliance solely on images. We propose a novel neuro_symbolic framework that integrates both panoramic-image and 3D point cloud information, combining neural perception with symbolic reasoning to explicitly model spatial and logical relationships. Our framework consists of a perception module for detecting entities and extracting attributes, and a reasoning module that constructs a structured scene graph to support precise, interpretable queries. Evaluated on the JRDB-Reasoning dataset, our approach demonstrates superior performance and reliability in crowded, human_built environments while maintaining a lightweight design suitable for robotics and embodied AI applications.

A Multi-Modal Neuro-Symbolic Approach for Spatial Reasoning-Based Visual Grounding in Robotics

TL;DR

Abstract

A Multi-Modal Neuro-Symbolic Approach for Spatial Reasoning-Based Visual Grounding in Robotics

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)