Table of Contents
Fetching ...

A Multi-Modal Neuro-Symbolic Approach for Spatial Reasoning-Based Visual Grounding in Robotics

Simindokht Jahangard, Mehrzad Mohammadi, Abhinav Dhall, Hamid Rezatofighi

TL;DR

The paper tackles the difficulty of fine-grained spatial reasoning in robotics by introducing a lightweight neuro-symbolic framework that fuses panoramic RGB-D imagery with 3D point clouds and grounds reasoning on an explicit scene graph. The perception module detects entities and extracts attributes with Florence and InternVL3.5, while the projection module fuses semantic features with 3D geometry to form a geometry-aware graph. A graph-search module performs query-driven reasoning through a two-phase Hybrid Attribute–Relational Filtering algorithm, enabling accurate visual grounding and VQA with interpretable relations. Demonstrated on JRDB-Reasoning, the approach achieves superior performance with only about 1.3B parameters, outperforming larger VLM baselines in complex relational queries and suggesting strong potential for real-world embodied AI applications.

Abstract

Visual reasoning, particularly spatial reasoning, is a challenging cognitive task that requires understanding object relationships and their interactions within complex environments, especially in robotics domain. Existing vision_language models (VLMs) excel at perception tasks but struggle with fine-grained spatial reasoning due to their implicit, correlation-driven reasoning and reliance solely on images. We propose a novel neuro_symbolic framework that integrates both panoramic-image and 3D point cloud information, combining neural perception with symbolic reasoning to explicitly model spatial and logical relationships. Our framework consists of a perception module for detecting entities and extracting attributes, and a reasoning module that constructs a structured scene graph to support precise, interpretable queries. Evaluated on the JRDB-Reasoning dataset, our approach demonstrates superior performance and reliability in crowded, human_built environments while maintaining a lightweight design suitable for robotics and embodied AI applications.

A Multi-Modal Neuro-Symbolic Approach for Spatial Reasoning-Based Visual Grounding in Robotics

TL;DR

The paper tackles the difficulty of fine-grained spatial reasoning in robotics by introducing a lightweight neuro-symbolic framework that fuses panoramic RGB-D imagery with 3D point clouds and grounds reasoning on an explicit scene graph. The perception module detects entities and extracts attributes with Florence and InternVL3.5, while the projection module fuses semantic features with 3D geometry to form a geometry-aware graph. A graph-search module performs query-driven reasoning through a two-phase Hybrid Attribute–Relational Filtering algorithm, enabling accurate visual grounding and VQA with interpretable relations. Demonstrated on JRDB-Reasoning, the approach achieves superior performance with only about 1.3B parameters, outperforming larger VLM baselines in complex relational queries and suggesting strong potential for real-world embodied AI applications.

Abstract

Visual reasoning, particularly spatial reasoning, is a challenging cognitive task that requires understanding object relationships and their interactions within complex environments, especially in robotics domain. Existing vision_language models (VLMs) excel at perception tasks but struggle with fine-grained spatial reasoning due to their implicit, correlation-driven reasoning and reliance solely on images. We propose a novel neuro_symbolic framework that integrates both panoramic-image and 3D point cloud information, combining neural perception with symbolic reasoning to explicitly model spatial and logical relationships. Our framework consists of a perception module for detecting entities and extracting attributes, and a reasoning module that constructs a structured scene graph to support precise, interpretable queries. Evaluated on the JRDB-Reasoning dataset, our approach demonstrates superior performance and reliability in crowded, human_built environments while maintaining a lightweight design suitable for robotics and embodied AI applications.

Paper Structure

This paper contains 10 sections, 9 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Schematic of the proposed framework: image, point cloud, and query serve as inputs. Feature extraction and projection generate a graph from attributes and edges, and the graph search module derives the final answer.
  • Figure 2: The overall framework of the model comprises a perception part—consisting of the Feature Extraction Module ($\mathcal{F}_E$) and Projection Module ($\mathcal{F}_P$) for semantic and geometric information extraction—and a reasoning part, the Graph Search Module ($\mathcal{F}_G$), which enables query-driven, interpretable spatial understanding.
  • Figure 3: Search Algorithm
  • Figure 4: A sample of our framework’s pipeline: the stitched image is first processed by Florance to detect humans, and their bounding boxes are passed to InterVL for attribute extraction and graph node construction. Outputs from Florance and the point cloud are integrated to derive spatial relations, and the resulting graph is searched based on the provided symbols.