Table of Contents
Fetching ...

ViUniT: Visual Unit Tests for More Robust Visual Programming

Artemis Panagopoulou, Honglu Zhou, Silvio Savarese, Caiming Xiong, Chris Callison-Burch, Mark Yatskar, Juan Carlos Niebles

TL;DR

ViUniT introduces Visual Unit Testing to verify the logical correctness of visual programs by automatically generating image-based unit tests. It builds an unsupervised pipeline that creates test descriptions with LLMs, converts them into images via diffusion models, and scores candidate programs on unit-test performance to select the best solution. The approach yields substantial gains across VQA and ITM tasks, notably enabling open-source 7B models to surpass certain proprietary baselines and reducing right-for-wrong reasoning. By coupling unit-test signals with re-prompting and reinforcement learning, ViUniT enhances robustness, interpretability, and reliability of visual reasoning systems with practical implications for multimodal AI deployment.

Abstract

Programming based approaches to reasoning tasks have substantially expanded the types of questions models can answer about visual scenes. Yet on benchmark visual reasoning data, when models answer correctly, they produce incorrect programs 33% of the time. These models are often right for the wrong reasons and risk unexpected failures on new data. Unit tests play a foundational role in ensuring code correctness and could be used to repair such failures. We propose Visual Unit Testing (ViUniT), a framework to improve the reliability of visual programs by automatically generating unit tests. In our framework, a unit test is represented as a novel image and answer pair meant to verify the logical correctness of a program produced for a given query. Our method leverages a language model to create unit tests in the form of image descriptions and expected answers and image synthesis to produce corresponding images. We conduct a comprehensive analysis of what constitutes an effective visual unit test suite, exploring unit test generation, sampling strategies, image generation methods, and varying the number of programs and unit tests. Additionally, we introduce four applications of visual unit tests: best program selection, answer refusal, re-prompting, and unsupervised reward formulations for reinforcement learning. Experiments with two models across three datasets in visual question answering and image-text matching demonstrate that ViUniT improves model performance by 11.4%. Notably, it enables 7B open-source models to outperform gpt-4o-mini by an average of 7.7% and reduces the occurrence of programs that are correct for the wrong reasons by 40%.

ViUniT: Visual Unit Tests for More Robust Visual Programming

TL;DR

ViUniT introduces Visual Unit Testing to verify the logical correctness of visual programs by automatically generating image-based unit tests. It builds an unsupervised pipeline that creates test descriptions with LLMs, converts them into images via diffusion models, and scores candidate programs on unit-test performance to select the best solution. The approach yields substantial gains across VQA and ITM tasks, notably enabling open-source 7B models to surpass certain proprietary baselines and reducing right-for-wrong reasoning. By coupling unit-test signals with re-prompting and reinforcement learning, ViUniT enhances robustness, interpretability, and reliability of visual reasoning systems with practical implications for multimodal AI deployment.

Abstract

Programming based approaches to reasoning tasks have substantially expanded the types of questions models can answer about visual scenes. Yet on benchmark visual reasoning data, when models answer correctly, they produce incorrect programs 33% of the time. These models are often right for the wrong reasons and risk unexpected failures on new data. Unit tests play a foundational role in ensuring code correctness and could be used to repair such failures. We propose Visual Unit Testing (ViUniT), a framework to improve the reliability of visual programs by automatically generating unit tests. In our framework, a unit test is represented as a novel image and answer pair meant to verify the logical correctness of a program produced for a given query. Our method leverages a language model to create unit tests in the form of image descriptions and expected answers and image synthesis to produce corresponding images. We conduct a comprehensive analysis of what constitutes an effective visual unit test suite, exploring unit test generation, sampling strategies, image generation methods, and varying the number of programs and unit tests. Additionally, we introduce four applications of visual unit tests: best program selection, answer refusal, re-prompting, and unsupervised reward formulations for reinforcement learning. Experiments with two models across three datasets in visual question answering and image-text matching demonstrate that ViUniT improves model performance by 11.4%. Notably, it enables 7B open-source models to outperform gpt-4o-mini by an average of 7.7% and reduces the occurrence of programs that are correct for the wrong reasons by 40%.

Paper Structure

This paper contains 56 sections, 10 equations, 20 figures, 14 tables, 1 algorithm.

Figures (20)

  • Figure 1: Framework Overview. Given a query $q$ about an image, the unit test generator $\psi$ generates a set $\mathcal{T}_{\text{cand}} = \psi(q, p)$ of $M$ candidate pairs $t_i = (c_i, y_i)$, each consisting of an image caption $c_i$ and an expected answer $y_i$ (Section \ref{['sec:candidates']}). The coverage sampler $\sigma$ then subsamples $K$ pairs from $\mathcal{T}_{\text{cand}}$, forming the subset $\mathcal{T}_K$ (Section \ref{['sec:sampling']}). These captions are passed to an image generator $M$ to create the corresponding images $v_i= M(c_i)$ for each unit test (Section \ref{['sec:image']}). Each candidate program is subsequently executed, and gets assigned a score $S(p)$ by the scorer $H$ based on its performance on the unit tests (Section \ref{['sec:unit_test_scoring']}). Finally, the highest scoring program is selected.
  • Figure 2: Visual Unit Testing Utilization Strategies (Section \ref{['sec:method_applicatons']}).
  • Figure 3: Unit Test Examples generated by
  • Figure 4: Comparison of Unit Tests Generated by Different Methods
  • Figure 5: Accuracy across varying unit test and program counts.
  • ...and 15 more figures