MuSLR: Multimodal Symbolic Logical Reasoning

Jundong Xu; Hao Fei; Yuhui Zhang; Liangming Pan; Qijun Huang; Qian Liu; Preslav Nakov; Min-Yen Kan; William Yang Wang; Mong-Li Lee; Wynne Hsu

MuSLR: Multimodal Symbolic Logical Reasoning

Jundong Xu, Hao Fei, Yuhui Zhang, Liangming Pan, Qijun Huang, Qian Liu, Preslav Nakov, Min-Yen Kan, William Yang Wang, Mong-Li Lee, Wynne Hsu

TL;DR

MuSLR introduces a formal benchmark and tasks for multimodal symbolic logical reasoning, addressing the gap in rigorously integrating vision and language with formal logic. It presents MuSLR-Bench, a dataset of $1{,}093$ instances across $7$ domains with depths from $2$ to $9$ and ground-truth reasoning chains, grounded in real-world contexts. To advance this area, the authors propose LogiCAM, a modular framework with Premise Selection, Reasoning Type Identification, and a Symbolic Reasoner, achieving substantial gains over base VLMs (e.g., a $14.13 ext{ exttt{ ext{ %}}}$ improvement when combined with GPT-4.1) and particularly excelling on complex logics like FOL. Across extensive analyses, cross-modal grounding errors dominate, highlighting the need for tighter multimodal fusion and logic-grounded training objectives to enable robust, transparent symbolic inference in multimodal systems.

Abstract

Multimodal symbolic logical reasoning, which aims to deduce new facts from multimodal input via formal logic, is critical in high-stakes applications such as autonomous driving and medical diagnosis, as its rigorous, deterministic reasoning helps prevent serious consequences. To evaluate such capabilities of current state-of-the-art vision language models (VLMs), we introduce the first benchmark MuSLR for multimodal symbolic logical reasoning grounded in formal logical rules. MuSLR comprises 1,093 instances across 7 domains, including 35 atomic symbolic logic and 976 logical combinations, with reasoning depths ranging from 2 to 9. We evaluate 7 state-of-the-art VLMs on MuSLR and find that they all struggle with multimodal symbolic reasoning, with the best model, GPT-4.1, achieving only 46.8%. Thus, we propose LogiCAM, a modular framework that applies formal logical rules to multimodal inputs, boosting GPT-4.1's Chain-of-Thought performance by 14.13%, and delivering even larger gains on complex logics such as first-order logic. We also conduct a comprehensive error analysis, showing that around 70% of failures stem from logical misalignment between modalities, offering key insights to guide future improvements. All data and code are publicly available at https://llm-symbol.github.io/MuSLR.

MuSLR: Multimodal Symbolic Logical Reasoning

TL;DR

Abstract

MuSLR: Multimodal Symbolic Logical Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)