Concept-RuleNet: Grounded Multi-Agent Neurosymbolic Reasoning in Vision Language Models
Sanchit Sinha, Guangzhi Xiong, Zhenghao He, Aidong Zhang
TL;DR
Concept-RuleNet introduces a grounded neurosymbolic framework that couples a visual-concept extraction agent, a symbol-exploration agent, and a verifier to produce human-interpretable rules for vision-language model predictions. By grounding symbol generation in training-image concepts rather than relying solely on task labels, it reduces symbol hallucinations and improves generalization to out-of-distribution data, achieving around a 5% average improvement over state-of-the-art baselines across five diverse datasets. The CRN++ variant further incorporates counterfactual symbols to widen decision boundaries and improve accuracy. This approach enhances interpretability and robustness of VLMs, with practical impact in medical imaging and remote sensing where reliable reasoning pathways are critical.
Abstract
Modern vision-language models (VLMs) deliver impressive predictive accuracy yet offer little insight into 'why' a decision is reached, frequently hallucinating facts, particularly when encountering out-of-distribution data. Neurosymbolic frameworks address this by pairing black-box perception with interpretable symbolic reasoning, but current methods extract their symbols solely from task labels, leaving them weakly grounded in the underlying visual data. In this paper, we introduce a multi-agent system - Concept-RuleNet that reinstates visual grounding while retaining transparent reasoning. Specifically, a multimodal concept generator first mines discriminative visual concepts directly from a representative subset of training images. Next, these visual concepts are utilized to condition symbol discovery, anchoring the generations in real image statistics and mitigating label bias. Subsequently, symbols are composed into executable first-order rules by a large language model reasoner agent - yielding interpretable neurosymbolic rules. Finally, during inference, a vision verifier agent quantifies the degree of presence of each symbol and triggers rule execution in tandem with outputs of black-box neural models, predictions with explicit reasoning pathways. Experiments on five benchmarks, including two challenging medical-imaging tasks and three underrepresented natural-image datasets, show that our system augments state-of-the-art neurosymbolic baselines by an average of 5% while also reducing the occurrence of hallucinated symbols in rules by up to 50%.
