PropTest: Automatic Property Testing for Improved Visual Programming
Jaywon Koo, Ziyan Yang, Paola Cascante-Bonilla, Baishakhi Ray, Vicente Ordonez
TL;DR
PropTest tackles failures in visual reasoning systems that rely on code generation by introducing automatic property test generation to constrain and validate LLM-produced solutions. By adopting a test-first paradigm, the framework yields tests that enforce data-type, syntactic, and semantic properties, guiding code generation and providing diagnostic insight when failures occur. Empirical results across GQA, A-OKVQA, RefCOCO, and RefCOCO+ show consistent accuracy and IoU gains over baselines using public LLMs, with notable improvements on several datasets and improved interpretability via failure analyses. The findings highlight the practical value of test-driven generation in neuro-symbolic visual reasoning and point to future work on prompt design, tool integration, and self-refinement to further enhance robustness.
Abstract
Visual Programming has recently emerged as an alternative to end-to-end black-box visual reasoning models. This type of method leverages Large Language Models (LLMs) to generate the source code for an executable computer program that solves a given problem. This strategy has the advantage of offering an interpretable reasoning path and does not require finetuning a model with task-specific data. We propose PropTest, a general strategy that improves visual programming by further using an LLM to generate code that tests for visual properties in an initial round of proposed solutions. Our method generates tests for data-type consistency, output syntax, and semantic properties. PropTest achieves comparable results to state-of-the-art methods while using publicly available LLMs. This is demonstrated across different benchmarks on visual question answering and referring expression comprehension. Particularly, PropTest improves ViperGPT by obtaining 46.1\% accuracy (+6.0\%) on GQA using Llama3-8B and 59.5\% (+8.1\%) on RefCOCO+ using CodeLlama-34B.
