Visual Set Program Synthesizer

Zehua Cheng; Wei Dai; Wenhu Zhang; Thomas Lukasiewicz; Jiahao Sun

Visual Set Program Synthesizer

Zehua Cheng, Wei Dai, Wenhu Zhang, Thomas Lukasiewicz, Jiahao Sun

Abstract

A user pointing their phone at a supermarket shelf and asking "Which soda has the least sugar?" poses a difficult challenge for current visual Al assistants. Such queries require not only object recognition, but explicit set-based reasoning such as filtering, comparison, and aggregation. Standard endto-end MLLMs often fail at these tasks because they lack an explicit mechanism for compositional logic. We propose treating visual reasoning as Visual Program Synthesis, where the model first generates a symbolic program that is executed by a separate engine grounded in visual scenes. We also introduce Set-VQA, a new benchmark designed specifically for evaluating set-based visual reasoning. Experiments show that our approach significantly outperforms state-of-the-art baselines on complex reasoning tasks, producing more systematic and transparent behavior while substantially improving answer accuracy. These results demonstrate that program-driven reasoning provides a principled alternative to black-box visual-language inference.

Visual Set Program Synthesizer

Abstract

Paper Structure (16 sections, 7 equations, 2 figures, 5 tables)

This paper contains 16 sections, 7 equations, 2 figures, 5 tables.

Introduction
Related Work
Methodologies
Problem Formulation
Compositional Actor-Set Theoretic Reward (CASTER)
Experiments
Experimental Setup
Evaluation
Main Results
Compositional Generalization
Ablation Studies
Error Analysis
Conclusion
Program Language and Execution Engine Definition
Experimental Setup and Hyperparameters
...and 1 more sections

Figures (2)

Figure 1: An Example of Visual Set Programmer with CASTER in Action Comparing with End-to-End Multi-modal LLM.
Figure 2: Visual Set Program Synthesizer is a framework for solving complex visual queries by explicitly introducing a structured machine-readable program language. The MLLM generates a program, which is then run by the Program Execution Engine. This engine relies on a perception stack (e.g., object detection, OCR) and a knowledge base to ground the program's set operations (e.g., FILTER, SELECT) in the visual scene and retrieve necessary attributes.

Visual Set Program Synthesizer

Abstract

Visual Set Program Synthesizer

Authors

Abstract

Table of Contents

Figures (2)