VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models
Gokul Karthik Kumar, Iheb Chaabane, Kebin Wu
TL;DR
VisCon-100K presents a context-grounded approach to visual-language model fine-tuning by constructing a large-scale dataset from interleaved image-text web documents. The pipeline generates image-contextual captions with GPT-4V and converts them into diverse Q&A pairs via OpenChat 3.5, followed by deduplication and a leaky modality mix to tightly couple visual and contextual information. Empirical results across six benchmarks show that contextual data improves performance for two VLMs, with the leaky modality mix further enhancing cross-modal integration. The work provides a contextual captioner trained on VisCon-100K for scalable data generation and extends the method to VisCon-1M, highlighting practical value for open-source research and real-world multimodal systems.
Abstract
Vision-language models (VLMs) excel in various visual benchmarks but are often constrained by the lack of high-quality visual fine-tuning data. To address this challenge, we introduce VisCon-100K, a novel dataset derived from interleaved image-text web documents. Our approach transforms 45K web documents from the OBELICS dataset into 100K image conversation samples. We utilize GPT-4V to generate image-contextual captions and OpenChat 3.5 model to convert these captions into diverse free-form and multiple-choice question-answer pairs. Integrating this dataset for fine-tuning considerably enhances VLM performance across multiple benchmarks. Unlike methods that focus solely on fine-grained visual content, our approach leverages accompanying web context, yielding superior results. We also discover that a 'leaky modality mix', where conversation samples contain questions answerable from both the image and its contextual caption, outperforms non-leaky combinations of captions and Q&A pairs. VisCon-100k dataset shows strong performance with two popular VLM approaches: text-only large language model (LLM) aligned with a vision encoder using image captions data (ShareGPT4V-7b) and multimodally pretrained LLM (IDEFICS2-8b) using interleaved image-text data. In addition to releasing the VisCon-100K dataset, we provide a contextual captioner trained on this dataset, facilitating scalable fine-tuning data generation for future research and open-source applications. Using the same pipeline, but substituting our trained contextual captioner for GPT-4V, we also release the larger VisCon-1M dataset.
