Closing the Gap: Data-Centric Fine-Tuning of Vision Language Models for the Standardized Exam Questions
Egemen Sert, Şeyda Ertekin
TL;DR
The paper addresses how data design—specifically curriculum-aligned, multimodal supervision—can close the gap between open-weight vision–language models and proprietary systems on complex reasoning tasks.It introduces a 161.4 million-token multimodal corpus built from CoreReason, MetaReason, and ContextVQA, plus YKSUniform as a standardized benchmark, and demonstrates that data-centric SFT with the QMSA syntax can achieve near-state-of-the-art accuracy without reinforcement learning.Key findings show that dataset composition and structured meta-information substantially improve reasoning, while verbose teacher traces can hurt generalization; the EduMix-QMSA model attains 78.6% on YKSUniform, ranking competitive with proprietary models and highlighting the practical value of curated data.Collectively, the work provides open resources and a concrete data-centric framework for advancing open-weight vision–language models in education and other structured, multilingual domains.
Abstract
Multimodal reasoning has become a cornerstone of modern AI research. Standardized exam questions offer a uniquely rigorous testbed for such reasoning, providing structured visual contexts and verifiable answers. While recent progress has largely focused on algorithmic advances such as reinforcement learning (e.g., GRPO, DPO), the data centric foundations of vision language reasoning remain less explored. We show that supervised fine-tuning (SFT) with high-quality data can rival proprietary approaches. To this end, we compile a 161.4 million token multimodal dataset combining textbook question-solution pairs, curriculum aligned diagrams, and contextual materials, and fine-tune Qwen-2.5VL-32B using an optimized reasoning syntax (QMSA). The resulting model achieves 78.6% accuracy, only 1.0% below Gemini 2.0 Flash, on our newly released benchmark YKSUniform, which standardizes 1,854 multimodal exam questions across 309 curriculum topics. Our results reveal that data composition and representational syntax play a decisive role in multimodal reasoning. This work establishes a data centric framework for advancing open weight vision language models, demonstrating that carefully curated and curriculum-grounded multimodal data can elevate supervised fine-tuning to near state-of-the-art performance.
