Table of Contents
Fetching ...

Can VLMs Reason Robustly? A Neuro-Symbolic Investigation

Weixin Chen, Antonio Vergari, Han Zhao

Abstract

Vision-Language Models (VLMs) have been applied to a wide range of reasoning tasks, yet it remains unclear whether they can reason robustly under distribution shifts. In this paper, we study covariate shifts in which the perceptual input distribution changes while the underlying prediction rules do not. To investigate this question, we consider visual deductive reasoning tasks, where a model is required to answer a query given an image and logical rules defined over the object concepts in the image. Empirically, we find that VLMs fine-tuned through gradient-based end-to-end training can achieve high in-distribution accuracy but fail to generalize under such shifts, suggesting that fine-tuning does not reliably induce the underlying reasoning function. This motivates a neuro-symbolic perspective that decouples perception from reasoning. However, we further observe that recent neuro-symbolic approaches that rely on black-box components for reasoning can still exhibit inconsistent robustness across tasks. To address this issue, we propose VLC, a neuro-symbolic method that combines VLM-based concept recognition with circuit-based symbolic reasoning. In particular, task rules are compiled into a symbolic program, specifically a circuit, which executes the rules exactly over the object concepts recognized by the VLM. Experiments on three visual deductive reasoning tasks with distinct rule sets show that VLC consistently achieves strong performance under covariate shifts, highlighting its ability to support robust reasoning.

Can VLMs Reason Robustly? A Neuro-Symbolic Investigation

Abstract

Vision-Language Models (VLMs) have been applied to a wide range of reasoning tasks, yet it remains unclear whether they can reason robustly under distribution shifts. In this paper, we study covariate shifts in which the perceptual input distribution changes while the underlying prediction rules do not. To investigate this question, we consider visual deductive reasoning tasks, where a model is required to answer a query given an image and logical rules defined over the object concepts in the image. Empirically, we find that VLMs fine-tuned through gradient-based end-to-end training can achieve high in-distribution accuracy but fail to generalize under such shifts, suggesting that fine-tuning does not reliably induce the underlying reasoning function. This motivates a neuro-symbolic perspective that decouples perception from reasoning. However, we further observe that recent neuro-symbolic approaches that rely on black-box components for reasoning can still exhibit inconsistent robustness across tasks. To address this issue, we propose VLC, a neuro-symbolic method that combines VLM-based concept recognition with circuit-based symbolic reasoning. In particular, task rules are compiled into a symbolic program, specifically a circuit, which executes the rules exactly over the object concepts recognized by the VLM. Experiments on three visual deductive reasoning tasks with distinct rule sets show that VLC consistently achieves strong performance under covariate shifts, highlighting its ability to support robust reasoning.
Paper Structure (46 sections, 3 equations, 14 figures, 2 tables)

This paper contains 46 sections, 3 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Left: Given an image containing multiple objects and natural-language rules about object concepts, the (fine-tuned) VLM, Prism, and ViperGPT are required to answer a query based on the image by reasoning with the given rules. Middle:VLC decouples perception from reasoning and infers the final answer by applying the symbolic rule compiled into the circuit. Right: Performance of different paradigms on datasets sharing the same reasoning function but differing in the number of objects per image. Here, the reasoning function is logical XOR (see Section \ref{['sec:task']} for the task definition), and the three datasets contain images with three, five, and seven handwritten binary digits, respectively. The models are required to output the XOR of these digits.
  • Figure 2: VLC consists of two phases: VLM-based concept recognition (yellow blocks) and circuit-based symbolic reasoning (green blocks). We use the pySDD compiler to compile the symbolic rules into the circuit, SDD in particular. During inference, the VLM is prompted to recognize object concepts in the input image. The generated response is then extracted and processed as the binary inputs to the SDD. The SDD uses the compiled rules to execute exact inference over the binary inputs and the output binary values are converted to the final answer.
  • Figure 3: Comparison of concept accuracy and task accuracy for Prism and VLC across different datasets on the MNAdd task. While Prism and VLC achieve similar concept accuracy, VLC maintains a much smaller gap between concept accuracy and task accuracy, whereas this gap grows substantially for Prism as task complexity increases. This suggests that VLC ensures more robust reasoning under covariate shift.
  • Figure 4: (a) Effect of scaling up VLM size in End2end reasoning. (b-c) Effect of scaling up VLM size in VLC. (d) Effect of scaling up LLM size in Prism. Results are averaged over 5 random seeds, and error bars represent standard deviations.
  • Figure 5: Prompts designed for the end-to-end reasoning paradigm across different tasks. These prompts are used to prompt VLMs to generate the reasoning results.
  • ...and 9 more figures