Synthesizing Visual Concepts as Vision-Language Programs

Antonia Wüst; Wolfgang Stammer; Hikaru Shindo; Lukas Helff; Devendra Singh Dhami; Kristian Kersting

Synthesizing Visual Concepts as Vision-Language Programs

Antonia Wüst, Wolfgang Stammer, Hikaru Shindo, Lukas Helff, Devendra Singh Dhami, Kristian Kersting

TL;DR

This work introduces Vision-Language Programs (VLP), a neuro-symbolic framework that couples the perceptual strengths of Vision-Language Models with explicit symbolic reasoning via a domain-specific language and probabilistic program synthesis. By grounding symbols from few-shot labels, mapping them through VLM functions, and composing them with symbolic operators in a Probabilistic Context-Free Grammar, VLP induces executable visual rules that operate directly on images. The approach yields consistent improvements over direct prompting and structured prompting across synthetic and real-world datasets, and it enables interpretable, debuggable reasoning with a human-in-the-loop through DSL edits. The findings show VLP enhances generalization and reasoning robustness without requiring domain-specific encoders, offering a scalable path toward transparent and dependable visual concept induction.

Abstract

Vision-Language models (VLMs) achieve strong performance on multimodal tasks but often fail at systematic visual reasoning tasks, leading to inconsistent or illogical outputs. Neuro-symbolic methods promise to address this by inducing interpretable logical rules, though they exploit rigid, domain-specific perception modules. We propose Vision-Language Programs (VLP), which combine the perceptual flexibility of VLMs with systematic reasoning of program synthesis. Rather than embedding reasoning inside the VLM, VLP leverages the model to produce structured visual descriptions that are compiled into neuro-symbolic programs. The resulting programs execute directly on images, remain consistent with task constraints, and provide human-interpretable explanations that enable easy shortcut mitigation. Experiments on synthetic and real-world datasets demonstrate that VLPs outperform direct and structured prompting, particularly on tasks requiring complex logical reasoning.

Synthesizing Visual Concepts as Vision-Language Programs

TL;DR

Abstract

Synthesizing Visual Concepts as Vision-Language Programs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)