Vision-Language Model Dialog Games for Self-Improvement
Ksenia Konyushkova, Christos Kaplanis, Serkan Cabi, Misha Denil
TL;DR
The paper introduces VLM Dialog Games, a scalable self-improvement framework where two vision-language models, a Describer and a Guesser, engage in goal-oriented dialogs over unlabelled images. Successful dialogs are automatically filtered to form a high-quality synthetic dataset, which is used to fine-tune the base VLM and iteratively improve performance. Experiments on general VQA benchmarks and robotics success detection demonstrate that fine-tuning on dialog-generated data improves both in-game dialog success and downstream visual understanding, with improvements generalizing across datasets. The approach emphasizes data efficiency, domain adaptation, and minimal supervision, offering a promising recipe for self-improving VLMs in data-scarce settings.
Abstract
The increasing demand for high-quality, diverse training data poses a significant bottleneck in advancing vision-language models (VLMs). This paper presents VLM Dialog Games, a novel and scalable self-improvement framework for VLMs. Our approach leverages self-play between two agents engaged in a goal-oriented play centered around image identification. By filtering for successful game interactions, we automatically curate a high-quality dataset of interleaved images and text. We demonstrate that fine-tuning on this synthetic data leads to performance gains on downstream tasks and generalises across datasets. Moreover, as the improvements in the model lead to better game play, this procedure can be applied iteratively. This work paves the way for self-improving VLMs, with potential applications in various real-world scenarios especially when the high-quality multimodal data is scarce.
