Table of Contents
Fetching ...

ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools

Shaofeng Yin, Ting Lei, Yang Liu

TL;DR

The fine-tuned 7B LFMs on ToolVQA not only achieve impressive performance on the authors' test set but also surpass the large close-sourced model GPT-3.5-turbo on various out-of-distribution (OOD) datasets, demonstrating strong generalizability to real-world tool-use scenarios.

Abstract

Integrating external tools into Large Foundation Models (LFMs) has emerged as a promising approach to enhance their problem-solving capabilities. While existing studies have demonstrated strong performance in tool-augmented Visual Question Answering (VQA), recent benchmarks reveal significant gaps in real-world tool-use proficiency, particularly in functionally diverse multimodal settings requiring multi-step reasoning. In this work, we introduce ToolVQA, a large-scale multimodal dataset comprising 23K instances, designed to bridge this gap. Unlike previous datasets that rely on synthetic scenarios and simplified queries, ToolVQA features real-world visual contexts and challenging implicit multi-step reasoning tasks, better aligning with real user interactions. To construct this dataset, we propose ToolEngine, a novel data generation pipeline that employs Depth-First Search (DFS) with a dynamic in-context example matching mechanism to simulate human-like tool-use reasoning. ToolVQA encompasses 10 multimodal tools across 7 diverse task domains, with an average inference length of 2.78 reasoning steps per instance. The fine-tuned 7B LFMs on ToolVQA not only achieve impressive performance on our test set but also surpass the large close-sourced model GPT-3.5-turbo on various out-of-distribution (OOD) datasets, demonstrating strong generalizability to real-world tool-use scenarios.

ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools

TL;DR

The fine-tuned 7B LFMs on ToolVQA not only achieve impressive performance on the authors' test set but also surpass the large close-sourced model GPT-3.5-turbo on various out-of-distribution (OOD) datasets, demonstrating strong generalizability to real-world tool-use scenarios.

Abstract

Integrating external tools into Large Foundation Models (LFMs) has emerged as a promising approach to enhance their problem-solving capabilities. While existing studies have demonstrated strong performance in tool-augmented Visual Question Answering (VQA), recent benchmarks reveal significant gaps in real-world tool-use proficiency, particularly in functionally diverse multimodal settings requiring multi-step reasoning. In this work, we introduce ToolVQA, a large-scale multimodal dataset comprising 23K instances, designed to bridge this gap. Unlike previous datasets that rely on synthetic scenarios and simplified queries, ToolVQA features real-world visual contexts and challenging implicit multi-step reasoning tasks, better aligning with real user interactions. To construct this dataset, we propose ToolEngine, a novel data generation pipeline that employs Depth-First Search (DFS) with a dynamic in-context example matching mechanism to simulate human-like tool-use reasoning. ToolVQA encompasses 10 multimodal tools across 7 diverse task domains, with an average inference length of 2.78 reasoning steps per instance. The fine-tuned 7B LFMs on ToolVQA not only achieve impressive performance on our test set but also surpass the large close-sourced model GPT-3.5-turbo on various out-of-distribution (OOD) datasets, demonstrating strong generalizability to real-world tool-use scenarios.

Paper Structure

This paper contains 30 sections, 3 equations, 10 figures, 8 tables, 1 algorithm.

Figures (10)

  • Figure 1: Our real-world setting includes (1) complex visual scenarios with real-world context; (2) challenging queries with an implicit multi-step reasoning process. Existing datasets (left) do not meet these requirements, while our ToolVQA (right) does.
  • Figure 2: Image sources and corresponding tools of ToolVQA. We filter out overly simplistic tables and images from the data sources.
  • Figure 3: The pipeline of ToolEngine, contains three core components: Real-world Example Construction, Image-guided DFS on Tool Graph, and LCS-based Example Matching. Given an input image, we perform DFS on the complete tool graph. At each step, an LFM controller generates the next tool’s name and arguments, guided by the image, current tool-use trajectory, and matched examples. Once DFS is complete, the tool-use trajectory is determined, and then used to generate the query and answer.
  • Figure 4: Comparison between different matching strategies. When a fixed example limits the diversity of generation, LCS matching can integrate multiple examples to enhance generation.
  • Figure 5: Tool frequency.
  • ...and 5 more figures