Table of Contents
Fetching ...

Visual Set Program Synthesizer

Zehua Cheng, Wei Dai, Wenhu Zhang, Thomas Lukasiewicz, Jiahao Sun

Abstract

A user pointing their phone at a supermarket shelf and asking "Which soda has the least sugar?" poses a difficult challenge for current visual Al assistants. Such queries require not only object recognition, but explicit set-based reasoning such as filtering, comparison, and aggregation. Standard endto-end MLLMs often fail at these tasks because they lack an explicit mechanism for compositional logic. We propose treating visual reasoning as Visual Program Synthesis, where the model first generates a symbolic program that is executed by a separate engine grounded in visual scenes. We also introduce Set-VQA, a new benchmark designed specifically for evaluating set-based visual reasoning. Experiments show that our approach significantly outperforms state-of-the-art baselines on complex reasoning tasks, producing more systematic and transparent behavior while substantially improving answer accuracy. These results demonstrate that program-driven reasoning provides a principled alternative to black-box visual-language inference.

Visual Set Program Synthesizer

Abstract

A user pointing their phone at a supermarket shelf and asking "Which soda has the least sugar?" poses a difficult challenge for current visual Al assistants. Such queries require not only object recognition, but explicit set-based reasoning such as filtering, comparison, and aggregation. Standard endto-end MLLMs often fail at these tasks because they lack an explicit mechanism for compositional logic. We propose treating visual reasoning as Visual Program Synthesis, where the model first generates a symbolic program that is executed by a separate engine grounded in visual scenes. We also introduce Set-VQA, a new benchmark designed specifically for evaluating set-based visual reasoning. Experiments show that our approach significantly outperforms state-of-the-art baselines on complex reasoning tasks, producing more systematic and transparent behavior while substantially improving answer accuracy. These results demonstrate that program-driven reasoning provides a principled alternative to black-box visual-language inference.
Paper Structure (16 sections, 7 equations, 2 figures, 5 tables)

This paper contains 16 sections, 7 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: An Example of Visual Set Programmer with CASTER in Action Comparing with End-to-End Multi-modal LLM.
  • Figure 2: Visual Set Program Synthesizer is a framework for solving complex visual queries by explicitly introducing a structured machine-readable program language. The MLLM generates a program, which is then run by the Program Execution Engine. This engine relies on a perception stack (e.g., object detection, OCR) and a knowledge base to ground the program's set operations (e.g., FILTER, SELECT) in the visual scene and retrieve necessary attributes.