Table of Contents
Fetching ...

Vision-Language Model Dialog Games for Self-Improvement

Ksenia Konyushkova, Christos Kaplanis, Serkan Cabi, Misha Denil

TL;DR

The paper introduces VLM Dialog Games, a scalable self-improvement framework where two vision-language models, a Describer and a Guesser, engage in goal-oriented dialogs over unlabelled images. Successful dialogs are automatically filtered to form a high-quality synthetic dataset, which is used to fine-tune the base VLM and iteratively improve performance. Experiments on general VQA benchmarks and robotics success detection demonstrate that fine-tuning on dialog-generated data improves both in-game dialog success and downstream visual understanding, with improvements generalizing across datasets. The approach emphasizes data efficiency, domain adaptation, and minimal supervision, offering a promising recipe for self-improving VLMs in data-scarce settings.

Abstract

The increasing demand for high-quality, diverse training data poses a significant bottleneck in advancing vision-language models (VLMs). This paper presents VLM Dialog Games, a novel and scalable self-improvement framework for VLMs. Our approach leverages self-play between two agents engaged in a goal-oriented play centered around image identification. By filtering for successful game interactions, we automatically curate a high-quality dataset of interleaved images and text. We demonstrate that fine-tuning on this synthetic data leads to performance gains on downstream tasks and generalises across datasets. Moreover, as the improvements in the model lead to better game play, this procedure can be applied iteratively. This work paves the way for self-improving VLMs, with potential applications in various real-world scenarios especially when the high-quality multimodal data is scarce.

Vision-Language Model Dialog Games for Self-Improvement

TL;DR

The paper introduces VLM Dialog Games, a scalable self-improvement framework where two vision-language models, a Describer and a Guesser, engage in goal-oriented dialogs over unlabelled images. Successful dialogs are automatically filtered to form a high-quality synthetic dataset, which is used to fine-tune the base VLM and iteratively improve performance. Experiments on general VQA benchmarks and robotics success detection demonstrate that fine-tuning on dialog-generated data improves both in-game dialog success and downstream visual understanding, with improvements generalizing across datasets. The approach emphasizes data efficiency, domain adaptation, and minimal supervision, offering a promising recipe for self-improving VLMs in data-scarce settings.

Abstract

The increasing demand for high-quality, diverse training data poses a significant bottleneck in advancing vision-language models (VLMs). This paper presents VLM Dialog Games, a novel and scalable self-improvement framework for VLMs. Our approach leverages self-play between two agents engaged in a goal-oriented play centered around image identification. By filtering for successful game interactions, we automatically curate a high-quality dataset of interleaved images and text. We demonstrate that fine-tuning on this synthetic data leads to performance gains on downstream tasks and generalises across datasets. Moreover, as the improvements in the model lead to better game play, this procedure can be applied iteratively. This work paves the way for self-improving VLMs, with potential applications in various real-world scenarios especially when the high-quality multimodal data is scarce.

Paper Structure

This paper contains 44 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Example interaction between a Guesser and a Describer in the VLM Dialog Game. The Guesser aims to identify the target image from a set of distractors by asking questions, which the Describer answers. Since the Guesser correctly identifies the target image at the end of the game, this dialog is considered successful and included in the fine-tuning data.
  • Figure 2: An example dialog game using images from the DOCCI dataset, grouped by clusters. The figure shows the Guesser's questions, the Describer's answers, and the Guesser's internal dialog summary. The Guesser correctly identifies the target image (4) at the end of the dialog.
  • Figure 3: An example of a dialog game with OpenImages grouped by the image similarity. The figure shows the Guesser's questions, the Describer's answers, and the Guesser's internal dialog summary. The Guesser correctly identifies the target image (1) at the end of the dialog.
  • Figure 4: An example of a dialog game in the robotics domain. The figure shows the Guesser's questions, the Describer's answers, and the Guesser's internal dialog summary. The Guesser correctly identifies the target image (1) at the end of the dialog.
  • Figure 5: An example of a dialog game with two images.
  • ...and 2 more figures