Table of Contents
Fetching ...

MuSLR: Multimodal Symbolic Logical Reasoning

Jundong Xu, Hao Fei, Yuhui Zhang, Liangming Pan, Qijun Huang, Qian Liu, Preslav Nakov, Min-Yen Kan, William Yang Wang, Mong-Li Lee, Wynne Hsu

TL;DR

MuSLR introduces a formal benchmark and tasks for multimodal symbolic logical reasoning, addressing the gap in rigorously integrating vision and language with formal logic. It presents MuSLR-Bench, a dataset of $1{,}093$ instances across $7$ domains with depths from $2$ to $9$ and ground-truth reasoning chains, grounded in real-world contexts. To advance this area, the authors propose LogiCAM, a modular framework with Premise Selection, Reasoning Type Identification, and a Symbolic Reasoner, achieving substantial gains over base VLMs (e.g., a $14.13 ext{ exttt{ ext{ %}}}$ improvement when combined with GPT-4.1) and particularly excelling on complex logics like FOL. Across extensive analyses, cross-modal grounding errors dominate, highlighting the need for tighter multimodal fusion and logic-grounded training objectives to enable robust, transparent symbolic inference in multimodal systems.

Abstract

Multimodal symbolic logical reasoning, which aims to deduce new facts from multimodal input via formal logic, is critical in high-stakes applications such as autonomous driving and medical diagnosis, as its rigorous, deterministic reasoning helps prevent serious consequences. To evaluate such capabilities of current state-of-the-art vision language models (VLMs), we introduce the first benchmark MuSLR for multimodal symbolic logical reasoning grounded in formal logical rules. MuSLR comprises 1,093 instances across 7 domains, including 35 atomic symbolic logic and 976 logical combinations, with reasoning depths ranging from 2 to 9. We evaluate 7 state-of-the-art VLMs on MuSLR and find that they all struggle with multimodal symbolic reasoning, with the best model, GPT-4.1, achieving only 46.8%. Thus, we propose LogiCAM, a modular framework that applies formal logical rules to multimodal inputs, boosting GPT-4.1's Chain-of-Thought performance by 14.13%, and delivering even larger gains on complex logics such as first-order logic. We also conduct a comprehensive error analysis, showing that around 70% of failures stem from logical misalignment between modalities, offering key insights to guide future improvements. All data and code are publicly available at https://llm-symbol.github.io/MuSLR.

MuSLR: Multimodal Symbolic Logical Reasoning

TL;DR

MuSLR introduces a formal benchmark and tasks for multimodal symbolic logical reasoning, addressing the gap in rigorously integrating vision and language with formal logic. It presents MuSLR-Bench, a dataset of instances across domains with depths from to and ground-truth reasoning chains, grounded in real-world contexts. To advance this area, the authors propose LogiCAM, a modular framework with Premise Selection, Reasoning Type Identification, and a Symbolic Reasoner, achieving substantial gains over base VLMs (e.g., a improvement when combined with GPT-4.1) and particularly excelling on complex logics like FOL. Across extensive analyses, cross-modal grounding errors dominate, highlighting the need for tighter multimodal fusion and logic-grounded training objectives to enable robust, transparent symbolic inference in multimodal systems.

Abstract

Multimodal symbolic logical reasoning, which aims to deduce new facts from multimodal input via formal logic, is critical in high-stakes applications such as autonomous driving and medical diagnosis, as its rigorous, deterministic reasoning helps prevent serious consequences. To evaluate such capabilities of current state-of-the-art vision language models (VLMs), we introduce the first benchmark MuSLR for multimodal symbolic logical reasoning grounded in formal logical rules. MuSLR comprises 1,093 instances across 7 domains, including 35 atomic symbolic logic and 976 logical combinations, with reasoning depths ranging from 2 to 9. We evaluate 7 state-of-the-art VLMs on MuSLR and find that they all struggle with multimodal symbolic reasoning, with the best model, GPT-4.1, achieving only 46.8%. Thus, we propose LogiCAM, a modular framework that applies formal logical rules to multimodal inputs, boosting GPT-4.1's Chain-of-Thought performance by 14.13%, and delivering even larger gains on complex logics such as first-order logic. We also conduct a comprehensive error analysis, showing that around 70% of failures stem from logical misalignment between modalities, offering key insights to guide future improvements. All data and code are publicly available at https://llm-symbol.github.io/MuSLR.

Paper Structure

This paper contains 53 sections, 32 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: An example of a depth-4 propositional logic task, requiring the VLMs to apply formal symbolic logic rules and integrate multimodalities to reach the conclusion.
  • Figure 2: Pipeline of MuSLR data construction. We begin by collecting multimodal data and symbolic rules. These rules are then combined to form reasoning chains, which are grounded in real-world contexts to generate questions and answers, followed by a strict quality check.
  • Figure 3: Dataset Statistics. The left table presents general dataset statistics. The middle pie chart illustrates the distribution across domains and symbolic logic. The right bar charts display the number of instances by reasoning depth and data source.
  • Figure 4: LogiCAM Workflow. The figure illustrates a single iteration; the complete multi-iteration reasoning process is detailed in Section \ref{['fig:case_study']}.
  • Figure 5: Accuracy of symbolic logic
  • ...and 5 more figures