Table of Contents
Fetching ...

Logic-RAG: Augmenting Large Multimodal Models with Visual-Spatial Knowledge for Road Scene Understanding

Imran Kabir, Md Alimoor Reza, Syed Billah

TL;DR

Logic-RAG tackles the gap in visual-spatial reasoning of large multimodal models used in autonomous driving by grounding their responses in a dynamic, first-order logic knowledge base. It combines a perception module to extract structured scene facts, a Query-to-Logic Embedder to translate natural questions into FOL predicates, and a symbolic Inference Engine to derive conclusions before feeding results to LMMs, thereby reducing hallucination and increasing interpretability. The approach yields substantial accuracy gains on synthetic driving scenes ($\uparrow$ to $>80\%$) and real-world KITTI data ($\uparrow$ to $\approx90\%$), with ablations showing both fact-based KB context and full logical inference contribute to improvements. The framework is modular and extensible, enabling domain experts to augment the KB with new predicates and rules, and it demonstrates practical potential for safer, more trustworthy autonomous driving systems by improving spatial reasoning and explainability of model-assisted decisions.

Abstract

Large multimodal models (LMMs) are increasingly integrated into autonomous driving systems for user interaction. However, their limitations in fine-grained spatial reasoning pose challenges for system interpretability and user trust. We introduce Logic-RAG, a novel Retrieval-Augmented Generation (RAG) framework that improves LMMs' spatial understanding in driving scenarios. Logic-RAG constructs a dynamic knowledge base (KB) about object-object relationships in first-order logic (FOL) using a perception module, a query-to-logic embedder, and a logical inference engine. We evaluated Logic-RAG on visual-spatial queries using both synthetic and real-world driving videos. When using popular LMMs (GPT-4V, Claude 3.5) as proxies for an autonomous driving system, these models achieved only 55% accuracy on synthetic driving scenes and under 75% on real-world driving scenes. Augmenting them with Logic-RAG increased their accuracies to over 80% and 90%, respectively. An ablation study showed that even without logical inference, the fact-based context constructed by Logic-RAG alone improved accuracy by 15%. Logic-RAG is extensible: it allows seamless replacement of individual components with improved versions and enables domain experts to compose new knowledge in both FOL and natural language. In sum, Logic-RAG addresses critical spatial reasoning deficiencies in LMMs for autonomous driving applications. Code and data are available at https://github.com/Imran2205/LogicRAG.

Logic-RAG: Augmenting Large Multimodal Models with Visual-Spatial Knowledge for Road Scene Understanding

TL;DR

Logic-RAG tackles the gap in visual-spatial reasoning of large multimodal models used in autonomous driving by grounding their responses in a dynamic, first-order logic knowledge base. It combines a perception module to extract structured scene facts, a Query-to-Logic Embedder to translate natural questions into FOL predicates, and a symbolic Inference Engine to derive conclusions before feeding results to LMMs, thereby reducing hallucination and increasing interpretability. The approach yields substantial accuracy gains on synthetic driving scenes ( to ) and real-world KITTI data ( to ), with ablations showing both fact-based KB context and full logical inference contribute to improvements. The framework is modular and extensible, enabling domain experts to augment the KB with new predicates and rules, and it demonstrates practical potential for safer, more trustworthy autonomous driving systems by improving spatial reasoning and explainability of model-assisted decisions.

Abstract

Large multimodal models (LMMs) are increasingly integrated into autonomous driving systems for user interaction. However, their limitations in fine-grained spatial reasoning pose challenges for system interpretability and user trust. We introduce Logic-RAG, a novel Retrieval-Augmented Generation (RAG) framework that improves LMMs' spatial understanding in driving scenarios. Logic-RAG constructs a dynamic knowledge base (KB) about object-object relationships in first-order logic (FOL) using a perception module, a query-to-logic embedder, and a logical inference engine. We evaluated Logic-RAG on visual-spatial queries using both synthetic and real-world driving videos. When using popular LMMs (GPT-4V, Claude 3.5) as proxies for an autonomous driving system, these models achieved only 55% accuracy on synthetic driving scenes and under 75% on real-world driving scenes. Augmenting them with Logic-RAG increased their accuracies to over 80% and 90%, respectively. An ablation study showed that even without logical inference, the fact-based context constructed by Logic-RAG alone improved accuracy by 15%. Logic-RAG is extensible: it allows seamless replacement of individual components with improved versions and enables domain experts to compose new knowledge in both FOL and natural language. In sum, Logic-RAG addresses critical spatial reasoning deficiencies in LMMs for autonomous driving applications. Code and data are available at https://github.com/Imran2205/LogicRAG.

Paper Structure

This paper contains 30 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: (a) Components of the Logic-RAG framework. It takes $N$ frames and visual-spatial reasoning questions as inputs. The perception module analyzes the frames to generate facts about properties and visual-spatial relationships of objects in the scene and constructs a knowledge base (KB) in the form of First-Order Logic (FOL). The Query-to-Logic embedder parses the natural language question into an FOL query predicate, which is then passed to the Inference Engine that performs the query resolution. (b) The integration of Logic-RAG into a black box LMM, which receives the inference output of our framework while generating the response.
  • Figure 2: An illustration of the limitations of current commercial LMM models in visual-spatial reasoning (VSR). (Top) A sample of 4 consecutive frames from a synthetic driving video where the VSR task will be performed. (Middle) The original prompts and four representative questions. The values of N and M are 10 for GPT-4V and 5 for Claude-3.5, respectively, due to their current limits. (Bottom) Four responses from: 1) human oracle (accuracy: 100%), 2) Logic-RAG, 3) GPT-4V, and 4) Claude-3.5. The correct answers are colored green, and the incorrect answers are colored red. Note that Logic-RAG's accuracy is higher than black-box LMMs, and its response is generated by logical inference in our FOL system.
  • Figure 3: The predicates (in black) show a portion of KB that we construct using the output of the perception module for frames in Fig.\ref{['fig:application_events_compose']}. Text in orange shows the queries that we make in the KB to learn the facts. For instance, for the frames in Fig.\ref{['fig:application_events_compose']}, ConstantSpeed($\text{vehicle01}$) is true.
  • Figure 4: Internal block diagram of our perception module. It takes video frames as input and generates semantic, depth, and optical flow maps, which are then utilized to track object instances and estimate relative distances.