Table of Contents
Fetching ...

DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning

Fucai Ke, Vijay Kumar B G, Xingjian Leng, Zhixi Cai, Zaid Khan, Weiqing Wang, Pari Delir Haghighi, Hamid Rezatofighi, Manmohan Chandraker

TL;DR

VR remains challenging due to unreliable tool interactions and data scarcity for training. DWIM addresses this by coupling discrepancy-aware training workflow generation with instruct-masking fine-tuning to cultivate tool-aware, multi-turn LLM agents. The approach yields state-of-the-art results across multiple VR benchmarks, with stronger generalization and higher data efficiency than prior methods, while reducing dependence on extensive prompting. These innovations offer a robust, scalable path toward practical, tool-aware visual reasoning in real-world settings.

Abstract

Visual reasoning (VR), which is crucial in many fields for enabling human-like visual understanding, remains highly challenging. Recently, compositional visual reasoning approaches, which leverage the reasoning abilities of large language models (LLMs) with integrated tools to solve problems, have shown promise as more effective strategies than end-to-end VR methods. However, these approaches face limitations, as frozen LLMs lack tool awareness in VR, leading to performance bottlenecks. While leveraging LLMs for reasoning is widely used in other domains, they are not directly applicable to VR due to limited training data, imperfect tools that introduce errors and reduce data collection efficiency in VR, and challenging in fine-tuning on noisy workflows. To address these challenges, we propose DWIM: i) Discrepancy-aware training Workflow generation, which assesses tool usage and extracts more viable workflows for training; and ii) Instruct-Masking fine-tuning, which guides the model to only clone effective actions, enabling the generation of more practical solutions. Our experiments demonstrate that DWIM achieves state-of-the-art performance across various VR tasks, exhibiting strong generalization on multiple widely-used datasets.

DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning

TL;DR

VR remains challenging due to unreliable tool interactions and data scarcity for training. DWIM addresses this by coupling discrepancy-aware training workflow generation with instruct-masking fine-tuning to cultivate tool-aware, multi-turn LLM agents. The approach yields state-of-the-art results across multiple VR benchmarks, with stronger generalization and higher data efficiency than prior methods, while reducing dependence on extensive prompting. These innovations offer a robust, scalable path toward practical, tool-aware visual reasoning in real-world settings.

Abstract

Visual reasoning (VR), which is crucial in many fields for enabling human-like visual understanding, remains highly challenging. Recently, compositional visual reasoning approaches, which leverage the reasoning abilities of large language models (LLMs) with integrated tools to solve problems, have shown promise as more effective strategies than end-to-end VR methods. However, these approaches face limitations, as frozen LLMs lack tool awareness in VR, leading to performance bottlenecks. While leveraging LLMs for reasoning is widely used in other domains, they are not directly applicable to VR due to limited training data, imperfect tools that introduce errors and reduce data collection efficiency in VR, and challenging in fine-tuning on noisy workflows. To address these challenges, we propose DWIM: i) Discrepancy-aware training Workflow generation, which assesses tool usage and extracts more viable workflows for training; and ii) Instruct-Masking fine-tuning, which guides the model to only clone effective actions, enabling the generation of more practical solutions. Our experiments demonstrate that DWIM achieves state-of-the-art performance across various VR tasks, exhibiting strong generalization on multiple widely-used datasets.

Paper Structure

This paper contains 23 sections, 6 equations, 10 figures, 11 tables, 1 algorithm.

Figures (10)

  • Figure 1: Comparison of the existing agent (Before) and our tool-aware agent (After). Both follow logically valid workflows with the same toolset, but our method improves tool selection and usage, minimizing tool-induced errors and ensuring more accurate, efficient execution.
  • Figure 2: Tools are not always reliable and may occasionally provide incorrect information. Consequently, workflows expected to yield correct answers may fail due to tool-related inaccuracies.
  • Figure 3: Overview of discrepancy-aware training workflow generation and instruct-masking process
  • Figure 4: Frozen LLM with in-context learning example v.s. 0-shot trained LLM performance on OKVQA dataset.
  • Figure 5: Auto-Exploring Agentic Framework. The LLM agent generates 〈 Code〉 for execution, 〈 Thought〉 for reasoning, or 〈 Done〉 to complete the task. It dynamically generates or refines actions while storing environmental information for incremental reasoning.
  • ...and 5 more figures