Table of Contents
Fetching ...

VDebugger: Harnessing Execution Feedback for Debugging Visual Programs

Xueqing Wu, Zongyu Lin, Songyan Zhao, Te-Lin Wu, Pan Lu, Nanyun Peng, Kai-Wei Chang

TL;DR

VDebugger is introduced, a novel critic-refiner framework trained to localize and debug visual programs by tracking execution step by step, and identifies and corrects program errors leveraging detailed execution feedback, improving interpretability and accuracy.

Abstract

Visual programs are executable code generated by large language models to address visual reasoning problems. They decompose complex questions into multiple reasoning steps and invoke specialized models for each step to solve the problems. However, these programs are prone to logic errors, with our preliminary evaluation showing that 58% of the total errors are caused by program logic errors. Debugging complex visual programs remains a major bottleneck for visual reasoning. To address this, we introduce VDebugger, a novel critic-refiner framework trained to localize and debug visual programs by tracking execution step by step. VDebugger identifies and corrects program errors leveraging detailed execution feedback, improving interpretability and accuracy. The training data is generated through an automated pipeline that injects errors into correct visual programs using a novel mask-best decoding technique. Evaluations on six datasets demonstrate VDebugger's effectiveness, showing performance improvements of up to 3.2% in downstream task accuracy. Further studies show VDebugger's ability to generalize to unseen tasks, bringing a notable improvement of 2.3% on the unseen COVR task. Code, data and models are made publicly available at https://github.com/shirley-wu/vdebugger/

VDebugger: Harnessing Execution Feedback for Debugging Visual Programs

TL;DR

VDebugger is introduced, a novel critic-refiner framework trained to localize and debug visual programs by tracking execution step by step, and identifies and corrects program errors leveraging detailed execution feedback, improving interpretability and accuracy.

Abstract

Visual programs are executable code generated by large language models to address visual reasoning problems. They decompose complex questions into multiple reasoning steps and invoke specialized models for each step to solve the problems. However, these programs are prone to logic errors, with our preliminary evaluation showing that 58% of the total errors are caused by program logic errors. Debugging complex visual programs remains a major bottleneck for visual reasoning. To address this, we introduce VDebugger, a novel critic-refiner framework trained to localize and debug visual programs by tracking execution step by step. VDebugger identifies and corrects program errors leveraging detailed execution feedback, improving interpretability and accuracy. The training data is generated through an automated pipeline that injects errors into correct visual programs using a novel mask-best decoding technique. Evaluations on six datasets demonstrate VDebugger's effectiveness, showing performance improvements of up to 3.2% in downstream task accuracy. Further studies show VDebugger's ability to generalize to unseen tasks, bringing a notable improvement of 2.3% on the unseen COVR task. Code, data and models are made publicly available at https://github.com/shirley-wu/vdebugger/
Paper Structure (14 sections, 6 equations, 11 figures, 8 tables, 2 algorithms)

This paper contains 14 sections, 6 equations, 11 figures, 8 tables, 2 algorithms.

Figures (11)

  • Figure 1: Overview of visual programming and VDebugger.Above: the visual program invokes APIs to answer the input question. Each involved API (e.g. find) is implemented with a specialized foundation VLM (e.g. object detection model). Below: VDebugger debugs the visual program by inspecting the execution process. In this example, the colors variable represents the colors of all skier's jackets and contains two values, but the return value "yes" suggests that all skiers wear jackets of the same color. Catching this discrepancy, the critique identifies that the last line of the program is incorrect, and the refiner rewrites that line into the correct code.
  • Figure 2: Training data collection pipeline. Given an existing dataset of question-anwswer pairs, we prompt LLM to generate correct programs, inject error to generate incorrect programs, and use the paired data for SFT training.
  • Figure 3: Categorization of synthetic errors generated by greedy decoding and mask-best decoding respectively.
  • Figure 4: Performance on GQA, NLVRv2 and RefCOCOg datasets by the number of debugging iterations.
  • Figure 5: Sources of errors on GQA, NLVRv2 and RefCOCOg datasets. We categorize the predictions into four categories: correct, multiple correct answers (where the prediction is correct but does not match the ground truth annotation), foundation VLM errors, and program errors.
  • ...and 6 more figures