Table of Contents
Fetching ...

Recursive Visual Programming

Jiaxin Ge, Sanjay Subramanian, Baifeng Shi, Roei Herzig, Trevor Darrell

TL;DR

Recursive Visual Programming (RVP) advances Visual Question Answering by replacing monolithic code generation with iterative recursive decomposition and dynamic return-type assignment. Built on a two-part architecture with a code generator and a visual executor, and enabled by a recursive_query mechanism, RVP breaks complex questions into simpler sub-questions, solves them modularly, and aggregates results. Empirical results across GQA, VSR, COVR, and NextQA demonstrate improved accuracy and enhanced interpretability, supported by ablations on dynamic typing, prompts, and error-feedback loops. The work also analyzes open-source versus proprietary models, prompting strategies, and readability impacts, suggesting broad applicability of recursive, modular coding in visual reasoning and beyond.

Abstract

Visual Programming (VP) has emerged as a powerful framework for Visual Question Answering (VQA). By generating and executing bespoke code for each question, these methods demonstrate impressive compositional and reasoning capabilities, especially in few-shot and zero-shot scenarios. However, existing VP methods generate all code in a single function, resulting in code that is suboptimal in terms of both accuracy and interpretability. Inspired by human coding practices, we propose Recursive Visual Programming (RVP), which simplifies generated routines, provides more efficient problem solving, and can manage more complex data structures. RVP is inspired by human coding practices and approaches VQA tasks with an iterative recursive code generation approach, allowing decomposition of complicated problems into smaller parts. Notably, RVP is capable of dynamic type assignment, i.e., as the system recursively generates a new piece of code, it autonomously determines the appropriate return type and crafts the requisite code to generate that output. We show RVP's efficacy through extensive experiments on benchmarks including VSR, COVR, GQA, and NextQA, underscoring the value of adopting human-like recursive and modular programming techniques for solving VQA tasks through coding.

Recursive Visual Programming

TL;DR

Recursive Visual Programming (RVP) advances Visual Question Answering by replacing monolithic code generation with iterative recursive decomposition and dynamic return-type assignment. Built on a two-part architecture with a code generator and a visual executor, and enabled by a recursive_query mechanism, RVP breaks complex questions into simpler sub-questions, solves them modularly, and aggregates results. Empirical results across GQA, VSR, COVR, and NextQA demonstrate improved accuracy and enhanced interpretability, supported by ablations on dynamic typing, prompts, and error-feedback loops. The work also analyzes open-source versus proprietary models, prompting strategies, and readability impacts, suggesting broad applicability of recursive, modular coding in visual reasoning and beyond.

Abstract

Visual Programming (VP) has emerged as a powerful framework for Visual Question Answering (VQA). By generating and executing bespoke code for each question, these methods demonstrate impressive compositional and reasoning capabilities, especially in few-shot and zero-shot scenarios. However, existing VP methods generate all code in a single function, resulting in code that is suboptimal in terms of both accuracy and interpretability. Inspired by human coding practices, we propose Recursive Visual Programming (RVP), which simplifies generated routines, provides more efficient problem solving, and can manage more complex data structures. RVP is inspired by human coding practices and approaches VQA tasks with an iterative recursive code generation approach, allowing decomposition of complicated problems into smaller parts. Notably, RVP is capable of dynamic type assignment, i.e., as the system recursively generates a new piece of code, it autonomously determines the appropriate return type and crafts the requisite code to generate that output. We show RVP's efficacy through extensive experiments on benchmarks including VSR, COVR, GQA, and NextQA, underscoring the value of adopting human-like recursive and modular programming techniques for solving VQA tasks through coding.
Paper Structure (37 sections, 3 equations, 17 figures, 15 tables)

This paper contains 37 sections, 3 equations, 17 figures, 15 tables.

Figures (17)

  • Figure 1: A breakdown of Recursive Visual Programming for a visual question is illustrated with an image where scenes within it contain smaller versions of themselves. To locate a man wearing a hat at the third level of an image, the model first generates code that has two recursive_query API calls. The initial call identifies the largest mirror (representing the next image level), and the subsequent call seeks "the man with a hat" at the second level. The model iteratively generates and executes code through these calls until the final answer is produced without additional recursive calls.
  • Figure 1: Examples of CodeLlama. In this figure, we provide example codes generated by CodeLlama. While CodeLlama demonstrates impressive reasoning abilities (Below), it also makes certain errors, such as defining unused variables (Above). Interestingly, CodeLlama learns to include logical comments within the code.
  • Figure 2: Motivating Examples. Adding the prompt "solve the problem recursively" to the input significantly improves the model's performance on traditional coding tasks, Dyck language suzgun2022challenging and Games of 24 yao2023tree. "NR" refers to non-recursive approach and "R" refers to recursive approach. This shows that LLMs can potentially benefit from human's recursive programming approach.
  • Figure 2: CodeLlama Type Confusion Example. This figure illustrates an instance where CodeLlama confuses types, despite having the return type explicitly specified. Even though the recursive_query explicitly states "Return a bool" and generates a function that begins with the signature '-> bool', the function returns an incorrect type str.
  • Figure 3: Example from COVR (Upper). Current VP methods fail to provide binary 'yes' or 'no' answers, while Recursive Visual Programming method outputs the correct answer. Example from GQA (Middle and Bottom). Recursive outperforms current non-recursive methods by correctly addressing all details and their associated logic.
  • ...and 12 more figures