Table of Contents
Fetching ...

Concept-RuleNet: Grounded Multi-Agent Neurosymbolic Reasoning in Vision Language Models

Sanchit Sinha, Guangzhi Xiong, Zhenghao He, Aidong Zhang

TL;DR

Concept-RuleNet introduces a grounded neurosymbolic framework that couples a visual-concept extraction agent, a symbol-exploration agent, and a verifier to produce human-interpretable rules for vision-language model predictions. By grounding symbol generation in training-image concepts rather than relying solely on task labels, it reduces symbol hallucinations and improves generalization to out-of-distribution data, achieving around a 5% average improvement over state-of-the-art baselines across five diverse datasets. The CRN++ variant further incorporates counterfactual symbols to widen decision boundaries and improve accuracy. This approach enhances interpretability and robustness of VLMs, with practical impact in medical imaging and remote sensing where reliable reasoning pathways are critical.

Abstract

Modern vision-language models (VLMs) deliver impressive predictive accuracy yet offer little insight into 'why' a decision is reached, frequently hallucinating facts, particularly when encountering out-of-distribution data. Neurosymbolic frameworks address this by pairing black-box perception with interpretable symbolic reasoning, but current methods extract their symbols solely from task labels, leaving them weakly grounded in the underlying visual data. In this paper, we introduce a multi-agent system - Concept-RuleNet that reinstates visual grounding while retaining transparent reasoning. Specifically, a multimodal concept generator first mines discriminative visual concepts directly from a representative subset of training images. Next, these visual concepts are utilized to condition symbol discovery, anchoring the generations in real image statistics and mitigating label bias. Subsequently, symbols are composed into executable first-order rules by a large language model reasoner agent - yielding interpretable neurosymbolic rules. Finally, during inference, a vision verifier agent quantifies the degree of presence of each symbol and triggers rule execution in tandem with outputs of black-box neural models, predictions with explicit reasoning pathways. Experiments on five benchmarks, including two challenging medical-imaging tasks and three underrepresented natural-image datasets, show that our system augments state-of-the-art neurosymbolic baselines by an average of 5% while also reducing the occurrence of hallucinated symbols in rules by up to 50%.

Concept-RuleNet: Grounded Multi-Agent Neurosymbolic Reasoning in Vision Language Models

TL;DR

Concept-RuleNet introduces a grounded neurosymbolic framework that couples a visual-concept extraction agent, a symbol-exploration agent, and a verifier to produce human-interpretable rules for vision-language model predictions. By grounding symbol generation in training-image concepts rather than relying solely on task labels, it reduces symbol hallucinations and improves generalization to out-of-distribution data, achieving around a 5% average improvement over state-of-the-art baselines across five diverse datasets. The CRN++ variant further incorporates counterfactual symbols to widen decision boundaries and improve accuracy. This approach enhances interpretability and robustness of VLMs, with practical impact in medical imaging and remote sensing where reliable reasoning pathways are critical.

Abstract

Modern vision-language models (VLMs) deliver impressive predictive accuracy yet offer little insight into 'why' a decision is reached, frequently hallucinating facts, particularly when encountering out-of-distribution data. Neurosymbolic frameworks address this by pairing black-box perception with interpretable symbolic reasoning, but current methods extract their symbols solely from task labels, leaving them weakly grounded in the underlying visual data. In this paper, we introduce a multi-agent system - Concept-RuleNet that reinstates visual grounding while retaining transparent reasoning. Specifically, a multimodal concept generator first mines discriminative visual concepts directly from a representative subset of training images. Next, these visual concepts are utilized to condition symbol discovery, anchoring the generations in real image statistics and mitigating label bias. Subsequently, symbols are composed into executable first-order rules by a large language model reasoner agent - yielding interpretable neurosymbolic rules. Finally, during inference, a vision verifier agent quantifies the degree of presence of each symbol and triggers rule execution in tandem with outputs of black-box neural models, predictions with explicit reasoning pathways. Experiments on five benchmarks, including two challenging medical-imaging tasks and three underrepresented natural-image datasets, show that our system augments state-of-the-art neurosymbolic baselines by an average of 5% while also reducing the occurrence of hallucinated symbols in rules by up to 50%.

Paper Structure

This paper contains 29 sections, 15 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Examples sampled from the BloodMNIST yang2023medmnist and UC-Merced Land Use yang2010bag test datasets demonstrating a sample from the 'Basophil' and 'Agriculture Land Pattern' classes, respectively. We list the top rules influencing the decision-making process. (TOP) We observe that utilizing no images during the symbolic rule generation process (Symbol-LLM) generates rules with non-grounded symbols, i.e., symbols NOT present in test images (hallucinations). (BOTTOM) We observe that the generated rules are often somewhat semantically related to the task label but not representative of the task. The highlighted symbols are most relevant for prediction, with red symbols being hallucinated, green being appropriate, and yellow being non-representative.
  • Figure 2: Schematic figure of Concept-RuleNet approach. Concept-RuleNet operates in three sequential stages - (i) Grounded Visual Concept Extraction: outputs visual concepts grounded in representative training images, (ii) Conditional Symbol Generation and Neurosymbolic Rule-based Predictions: explores relevant symbols and composes them into logical rules, and (iii) Neurosymbolic Rule-based Predictions: verifies the presence of each symbol in the rule to provide final predictions.
  • Figure 3: Inference process for a sample from the 'Basophil' class from BloodMNIST dataset. VLM inference output assigns a probability score of $0.48$ to correct class.
  • Figure 4: (a) Symbol grounding and (b) rule length–accuracy tradeoff.
  • Figure 5: (LEFT) Concept-RuleNet generated visual concepts, symbols, and rules. (RIGHT) Symbol-LLM generated symbols and rules for the 'denseresidential' class in the satellite dataset. We observe that Symbol-LLM outputs multiple symbols with high probability of being presented (represented as bold numbers in the rules), but which are non-representative of the task - classifying images into 'denseresidential' category. E.g. 'high-demand for housing' is an irrelevant symbol for this task.
  • ...and 5 more figures