Table of Contents
Fetching ...

VQA Training Sets are Self-play Environments for Generating Few-shot Pools

Tautvydas Misiunas, Hassan Mansoor, Jasper Uijlings, Oriana Riva, Victor Carbune

TL;DR

The paper addresses the costly data construction challenge in visual question answering by transforming existing training sets into self-play environments where a multimodal model (Gemini) learns to use itself or an auxiliary tool (ScreenAI) to decompose and solve complex visual reasoning tasks. It bootstraps the process with zero-shot prompts and iteratively refines them into few-shot pools, using the training task metric as the reward to filter and propagate successful exemplars. Across ChartQA, PlotQA v2, InfographicVQA, and DocVQA, the approach yields substantial gains over zero-shot baselines, with two training steps producing meaningful improvements and mixed-shot pools offering further benefits through aggregation strategies like VLM-Judge. The method demonstrates strong generalization with limited data, highlighting a path toward reducing dataset construction costs while enhancing compositional reasoning in vision-language models, particularly for charts, infographics, and documents.

Abstract

Large-language models and large-vision models are increasingly capable of solving compositional reasoning tasks, as measured by breakthroughs in visual-question answering benchmarks. However, state-of-the-art solutions often involve careful construction of large pre-training and fine-tuning datasets, which can be expensive. The use of external tools, whether other ML models, search engines, or APIs, can significantly improve performance by breaking down high-level reasoning questions into sub-questions that are answerable by individual tools, but this approach has similar dataset construction costs to teach fine-tuned models how to use the available tools. We propose a technique in which existing training sets can be directly used for constructing computational environments with task metrics as rewards. This enables a model to autonomously teach itself to use itself or another model as a tool. By doing so, we augment training sets by integrating external signals. The proposed method starts with zero-shot prompts and iteratively refines them by selecting few-shot examples that maximize the task metric on the training set. Our experiments showcase how Gemini learns how to use itself, or another smaller and specialized model such as ScreenAI, to iteratively improve performance on training sets. Our approach successfully generalizes and improves upon zeroshot performance on charts, infographics, and document visual question-answering datasets

VQA Training Sets are Self-play Environments for Generating Few-shot Pools

TL;DR

The paper addresses the costly data construction challenge in visual question answering by transforming existing training sets into self-play environments where a multimodal model (Gemini) learns to use itself or an auxiliary tool (ScreenAI) to decompose and solve complex visual reasoning tasks. It bootstraps the process with zero-shot prompts and iteratively refines them into few-shot pools, using the training task metric as the reward to filter and propagate successful exemplars. Across ChartQA, PlotQA v2, InfographicVQA, and DocVQA, the approach yields substantial gains over zero-shot baselines, with two training steps producing meaningful improvements and mixed-shot pools offering further benefits through aggregation strategies like VLM-Judge. The method demonstrates strong generalization with limited data, highlighting a path toward reducing dataset construction costs while enhancing compositional reasoning in vision-language models, particularly for charts, infographics, and documents.

Abstract

Large-language models and large-vision models are increasingly capable of solving compositional reasoning tasks, as measured by breakthroughs in visual-question answering benchmarks. However, state-of-the-art solutions often involve careful construction of large pre-training and fine-tuning datasets, which can be expensive. The use of external tools, whether other ML models, search engines, or APIs, can significantly improve performance by breaking down high-level reasoning questions into sub-questions that are answerable by individual tools, but this approach has similar dataset construction costs to teach fine-tuned models how to use the available tools. We propose a technique in which existing training sets can be directly used for constructing computational environments with task metrics as rewards. This enables a model to autonomously teach itself to use itself or another model as a tool. By doing so, we augment training sets by integrating external signals. The proposed method starts with zero-shot prompts and iteratively refines them by selecting few-shot examples that maximize the task metric on the training set. Our experiments showcase how Gemini learns how to use itself, or another smaller and specialized model such as ScreenAI, to iteratively improve performance on training sets. Our approach successfully generalizes and improves upon zeroshot performance on charts, infographics, and document visual question-answering datasets
Paper Structure (32 sections, 4 figures, 6 tables)

This paper contains 32 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Training sets are transformed into self-play environments where Gemini learns how to use itself, or another model such as ScreenAI, as a tool. Each training step uses the VQA task metric as reward by filtering correctly solved examples and using them as few-shot example in the next round. We seed the environments with zero-shot prompts using three different code generation approaches.
  • Figure 2: Example of a compositional reasoning question from ChartQA masry2022chartqa. Gemini predicts code conditioned on the image, re-using itself through an API for visual information lookup (image_obj.answer) and leveraging the computational environment for the arithmetic comparison (i.e., comparing bar values).
  • Figure 3: Visual program-of-thought chen2023program creates intermediate data structures using extracted values from the image in order to provide an answer that requires arithmetic computations.
  • Figure :