Table of Contents
Fetching ...

De-fine: Decomposing and Refining Visual Programs with Auto-Feedback

Minghe Gao, Juncheng Li, Hao Fei, Liang Pang, Wei Ji, Guoming Wang, Zheqi Lv, Wenqiao Zhang, Siliang Tang, Yueting Zhuang

TL;DR

De-fine introduces a training-free framework that decomposes complex visual reasoning tasks into executable program blocks and iteratively refines them via multifaceted auto-feedback. By generating an abstract logical prompt, leveraging hierarchical task structure, and evolving a codebase through feedback-driven refinement, it achieves state-of-the-art zero-shot results across multiple vision-language tasks without task-specific training. The approach combines visual, textual, compile, and optional human feedback to guide program improvement, enabling robust handling of multi-step and cross-modal reasoning. Its model-agnostic design and codebase evolution paradigm offer practical benefits for scalable visual reasoning and open new avenues for agent-based research through modular feedback components.

Abstract

Visual programming, a modular and generalizable paradigm, integrates different modules and Python operators to solve various vision-language tasks. Unlike end-to-end models that need task-specific data, it advances in performing visual processing and reasoning in an unsupervised manner. Current visual programming methods generate programs in a single pass for each task where the ability to evaluate and optimize based on feedback, unfortunately, is lacking, which consequentially limits their effectiveness for complex, multi-step problems. Drawing inspiration from benders decomposition, we introduce De-fine, a training-free framework that automatically decomposes complex tasks into simpler subtasks and refines programs through auto-feedback. This model-agnostic approach can improve logical reasoning performance by integrating the strengths of multiple models. Our experiments across various visual tasks show that De-fine creates more robust programs. Moreover, viewing each feedback module as an independent agent will yield fresh prospects for the field of agent research.

De-fine: Decomposing and Refining Visual Programs with Auto-Feedback

TL;DR

De-fine introduces a training-free framework that decomposes complex visual reasoning tasks into executable program blocks and iteratively refines them via multifaceted auto-feedback. By generating an abstract logical prompt, leveraging hierarchical task structure, and evolving a codebase through feedback-driven refinement, it achieves state-of-the-art zero-shot results across multiple vision-language tasks without task-specific training. The approach combines visual, textual, compile, and optional human feedback to guide program improvement, enabling robust handling of multi-step and cross-modal reasoning. Its model-agnostic design and codebase evolution paradigm offer practical benefits for scalable visual reasoning and open new avenues for agent-based research through modular feedback components.

Abstract

Visual programming, a modular and generalizable paradigm, integrates different modules and Python operators to solve various vision-language tasks. Unlike end-to-end models that need task-specific data, it advances in performing visual processing and reasoning in an unsupervised manner. Current visual programming methods generate programs in a single pass for each task where the ability to evaluate and optimize based on feedback, unfortunately, is lacking, which consequentially limits their effectiveness for complex, multi-step problems. Drawing inspiration from benders decomposition, we introduce De-fine, a training-free framework that automatically decomposes complex tasks into simpler subtasks and refines programs through auto-feedback. This model-agnostic approach can improve logical reasoning performance by integrating the strengths of multiple models. Our experiments across various visual tasks show that De-fine creates more robust programs. Moreover, viewing each feedback module as an independent agent will yield fresh prospects for the field of agent research.
Paper Structure (15 sections, 6 figures, 7 tables)

This paper contains 15 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: De-fine decomposes the tasks into executable program blocks and automatically refines the program based on multifaceted feedback from the execution.
  • Figure 2: De-fine is a programming-based framework that can decompose tasks and refine the program. We summarize the process into four steps: (1) De-fine first constructs an abstract logical prompt. (2) We generate the program and execute it. (3) During execution, De-fine automatically generates multifaceted feedback for optimizing. (4) De-fine keeps the well-optimized code based on feedback and expands the codebase for future use. The pseudocode algorithm is shown in Appendix A.
  • Figure 3: The pipeline of abstract logical prompt generation. Initially, we generate substeps to address the given query. Subsequently, we retrieve the most relevant code based on the semantic relevance of code comments and substeps. Then, we mask irrelevant information in the retrieved code. This AL is finally provided as a prompt to the code generation model.
  • Figure 4: Refinement examples by the feedback of De-fine.
  • Figure 5: Analysis on (a) the number of iterative refinement and (b) abstract logical prompts.
  • ...and 1 more figures