Q-Mirror: Unlocking the Multi-Modal Potential of Scientific Text-Only QA Pairs
Junying Wang, Zicheng Zhang, Ye Shen, Yalun Wu, Yingji Liang, Yijin Guo, Farong Wen, Wenzhe Li, Xuezhi Zhao, Qi Jia, Guangtao Zhai
TL;DR
Q-Mirror tackles the bottleneck of scarce multi-modal scientific benchmarks by proposing a systematic TQA-to-MMQA transformation framework, a dual-benchmark setup for MMQA generation and evaluation, and an autonomous agent that iteratively refines MMQAs. The framework hinges on a multi-dimensional quality rubric with Information Consistency, Cross-Modal Integration, and Standalone Quality, and defines a formal AVG scoring mechanism. Empirical results show current LMMs can generate MMQAs but exhibit gaps in factual grounding, while top judges align well with human judgments; the Q-Mirror agent significantly improves MMQA quality (AVG 78.90 to 85.22 and pass rate 72% to 95%). This work offers a scalable, cost-effective path to large-scale multimodal scientific benchmarks, enabling more robust evaluation and training of advanced reasoning models in science.
Abstract
High-quality, multi-modal benchmarks are crucial for advancing scientific reasoning in large models yet their manual creation is costly and unscalable. To address this bottleneck, we explore the potential for transforming Text-Only QA Pairs (TQAs) into high-quality Multi-Modal QA Pairs (MMQAs), which include three parts: 1) Task Definition \& Evaluation Rubric: We develop a TQA-to-MMQA framework and establish a comprehensive, multi-dimensional MMQA quality rubric that provides principles for the transformation. 2) Benchmark Construction: Then we construct two extensive benchmarks to rigorously evaluate state-of-the-art generation \& understanding models on the distinct tasks of MMQA generation \& MMQA quality evaluation. 3) Preliminary Solution: We develop an agentic system (Q-Mirror), which operationalizes our framework by integrating MMQA generation and evaluation into a closed loop for iterative refinement. Our experiments show that while state-of-the-art models can generate MMQAs, their outputs still leave substantial gaps, underscoring the need for reliable evaluation. We further demonstrate that top-tier understanding models align closely with human judgment in MMQA quality assessment. Leveraging both insights, the Q-Mirror agent raises average scores from 78.90 to 85.22 and pass rates from 72\% to 95\%, offering a practical path to large-scale scientific benchmarks.
