Table of Contents
Fetching ...

Synthesizing Visual Concepts as Vision-Language Programs

Antonia Wüst, Wolfgang Stammer, Hikaru Shindo, Lukas Helff, Devendra Singh Dhami, Kristian Kersting

TL;DR

This work introduces Vision-Language Programs (VLP), a neuro-symbolic framework that couples the perceptual strengths of Vision-Language Models with explicit symbolic reasoning via a domain-specific language and probabilistic program synthesis. By grounding symbols from few-shot labels, mapping them through VLM functions, and composing them with symbolic operators in a Probabilistic Context-Free Grammar, VLP induces executable visual rules that operate directly on images. The approach yields consistent improvements over direct prompting and structured prompting across synthetic and real-world datasets, and it enables interpretable, debuggable reasoning with a human-in-the-loop through DSL edits. The findings show VLP enhances generalization and reasoning robustness without requiring domain-specific encoders, offering a scalable path toward transparent and dependable visual concept induction.

Abstract

Vision-Language models (VLMs) achieve strong performance on multimodal tasks but often fail at systematic visual reasoning tasks, leading to inconsistent or illogical outputs. Neuro-symbolic methods promise to address this by inducing interpretable logical rules, though they exploit rigid, domain-specific perception modules. We propose Vision-Language Programs (VLP), which combine the perceptual flexibility of VLMs with systematic reasoning of program synthesis. Rather than embedding reasoning inside the VLM, VLP leverages the model to produce structured visual descriptions that are compiled into neuro-symbolic programs. The resulting programs execute directly on images, remain consistent with task constraints, and provide human-interpretable explanations that enable easy shortcut mitigation. Experiments on synthetic and real-world datasets demonstrate that VLPs outperform direct and structured prompting, particularly on tasks requiring complex logical reasoning.

Synthesizing Visual Concepts as Vision-Language Programs

TL;DR

This work introduces Vision-Language Programs (VLP), a neuro-symbolic framework that couples the perceptual strengths of Vision-Language Models with explicit symbolic reasoning via a domain-specific language and probabilistic program synthesis. By grounding symbols from few-shot labels, mapping them through VLM functions, and composing them with symbolic operators in a Probabilistic Context-Free Grammar, VLP induces executable visual rules that operate directly on images. The approach yields consistent improvements over direct prompting and structured prompting across synthetic and real-world datasets, and it enables interpretable, debuggable reasoning with a human-in-the-loop through DSL edits. The findings show VLP enhances generalization and reasoning robustness without requiring domain-specific encoders, offering a scalable path toward transparent and dependable visual concept induction.

Abstract

Vision-Language models (VLMs) achieve strong performance on multimodal tasks but often fail at systematic visual reasoning tasks, leading to inconsistent or illogical outputs. Neuro-symbolic methods promise to address this by inducing interpretable logical rules, though they exploit rigid, domain-specific perception modules. We propose Vision-Language Programs (VLP), which combine the perceptual flexibility of VLMs with systematic reasoning of program synthesis. Rather than embedding reasoning inside the VLM, VLP leverages the model to produce structured visual descriptions that are compiled into neuro-symbolic programs. The resulting programs execute directly on images, remain consistent with task constraints, and provide human-interpretable explanations that enable easy shortcut mitigation. Experiments on synthetic and real-world datasets demonstrate that VLPs outperform direct and structured prompting, particularly on tasks requiring complex logical reasoning.

Paper Structure

This paper contains 51 sections, 15 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: VLMs cannot reliably perform inductive logic learning from images, failing to capture visual compositions like "candles and birthday cake". Vision-Language Programs (VLP) employ explicit symbolic reasoning to overcome such visual reasoning errors while maintaining perceptual flexibility.
  • Figure 2: Overview of Vision-Language Programs synthesis. Relevant variables are first discovered from the input examples (i) and used to construct a task-specific DSL, including VLM-based functions (ii). Program synthesis (iii) then searches this space to retrieve the most probable program that also achieves the highest accuracy on the input.
  • Figure 3: Qualitative comparison on Bongard-RWR. Direct VLM (Qwen3) prompting produces an incorrect rule about "abundance", misclassifying a query image. Qwen3 w/ VLP discovers a correct program that identifies round objects and achieves perfect query classification accuracy.
  • Figure 4: VLP performance improves as more input images are provided, in contrast to baselines, which stagnate or decline. Results are aggregated over models from Table 1.
  • Figure 5: VLP performance on CLEVR-Hans3 with DSL edits. For InternVL3 size-related VLM functions were added, for Qwen3 shortcut-related colors (red, gold) removed.
  • ...and 11 more figures