Table of Contents
Fetching ...

PropTest: Automatic Property Testing for Improved Visual Programming

Jaywon Koo, Ziyan Yang, Paola Cascante-Bonilla, Baishakhi Ray, Vicente Ordonez

TL;DR

PropTest tackles failures in visual reasoning systems that rely on code generation by introducing automatic property test generation to constrain and validate LLM-produced solutions. By adopting a test-first paradigm, the framework yields tests that enforce data-type, syntactic, and semantic properties, guiding code generation and providing diagnostic insight when failures occur. Empirical results across GQA, A-OKVQA, RefCOCO, and RefCOCO+ show consistent accuracy and IoU gains over baselines using public LLMs, with notable improvements on several datasets and improved interpretability via failure analyses. The findings highlight the practical value of test-driven generation in neuro-symbolic visual reasoning and point to future work on prompt design, tool integration, and self-refinement to further enhance robustness.

Abstract

Visual Programming has recently emerged as an alternative to end-to-end black-box visual reasoning models. This type of method leverages Large Language Models (LLMs) to generate the source code for an executable computer program that solves a given problem. This strategy has the advantage of offering an interpretable reasoning path and does not require finetuning a model with task-specific data. We propose PropTest, a general strategy that improves visual programming by further using an LLM to generate code that tests for visual properties in an initial round of proposed solutions. Our method generates tests for data-type consistency, output syntax, and semantic properties. PropTest achieves comparable results to state-of-the-art methods while using publicly available LLMs. This is demonstrated across different benchmarks on visual question answering and referring expression comprehension. Particularly, PropTest improves ViperGPT by obtaining 46.1\% accuracy (+6.0\%) on GQA using Llama3-8B and 59.5\% (+8.1\%) on RefCOCO+ using CodeLlama-34B.

PropTest: Automatic Property Testing for Improved Visual Programming

TL;DR

PropTest tackles failures in visual reasoning systems that rely on code generation by introducing automatic property test generation to constrain and validate LLM-produced solutions. By adopting a test-first paradigm, the framework yields tests that enforce data-type, syntactic, and semantic properties, guiding code generation and providing diagnostic insight when failures occur. Empirical results across GQA, A-OKVQA, RefCOCO, and RefCOCO+ show consistent accuracy and IoU gains over baselines using public LLMs, with notable improvements on several datasets and improved interpretability via failure analyses. The findings highlight the practical value of test-driven generation in neuro-symbolic visual reasoning and point to future work on prompt design, tool integration, and self-refinement to further enhance robustness.

Abstract

Visual Programming has recently emerged as an alternative to end-to-end black-box visual reasoning models. This type of method leverages Large Language Models (LLMs) to generate the source code for an executable computer program that solves a given problem. This strategy has the advantage of offering an interpretable reasoning path and does not require finetuning a model with task-specific data. We propose PropTest, a general strategy that improves visual programming by further using an LLM to generate code that tests for visual properties in an initial round of proposed solutions. Our method generates tests for data-type consistency, output syntax, and semantic properties. PropTest achieves comparable results to state-of-the-art methods while using publicly available LLMs. This is demonstrated across different benchmarks on visual question answering and referring expression comprehension. Particularly, PropTest improves ViperGPT by obtaining 46.1\% accuracy (+6.0\%) on GQA using Llama3-8B and 59.5\% (+8.1\%) on RefCOCO+ using CodeLlama-34B.
Paper Structure (22 sections, 17 figures, 6 tables)

This paper contains 22 sections, 17 figures, 6 tables.

Figures (17)

  • Figure 1: Visual programming methods generate code for a program to solve a vision-and-language task such as VQA. PropTest improves on these methods by automatically generating testing code that probes for several output properties. This is used as additional information when generating code and checking the correctness of the output solutions. As a baseline we use ViperGPT under CodeLlama-7B for this example.
  • Figure 2: An overview of PropTest. Given an image and a question, the goal is to generate Python code that can be executed to get an answer. PropTest first calls an LLM to generate test cases based on the inferred properties of the answer. Then, the generated test cases are used to improve the quality of Python code.
  • Figure 3: Three different examples of property test cases generated for visual question answering and for visual grounding. The execute_command() is the generic name of the generated program code routine and result is the output from executing it.
  • Figure 4: Comparison of our method against visual programming methods with different LLMs across two tasks, four benchmarks. We report Accuracy on two visual question answering benchmarks, and IoU on two visual grounding benchmarks. GPT-4o* results are only tested on 500 subsamples.
  • Figure 5: Example results on GQA, A-OKVQA and RefCOCO. We show cases where PropTest succeeds but the baseline ViperGPT fails. Input questions and answers are shown on the left, generated property test cases in the middle, and code on the right.
  • ...and 12 more figures