Table of Contents
Fetching ...

Q-Mirror: Unlocking the Multi-Modal Potential of Scientific Text-Only QA Pairs

Junying Wang, Zicheng Zhang, Ye Shen, Yalun Wu, Yingji Liang, Yijin Guo, Farong Wen, Wenzhe Li, Xuezhi Zhao, Qi Jia, Guangtao Zhai

TL;DR

Q-Mirror tackles the bottleneck of scarce multi-modal scientific benchmarks by proposing a systematic TQA-to-MMQA transformation framework, a dual-benchmark setup for MMQA generation and evaluation, and an autonomous agent that iteratively refines MMQAs. The framework hinges on a multi-dimensional quality rubric with Information Consistency, Cross-Modal Integration, and Standalone Quality, and defines a formal AVG scoring mechanism. Empirical results show current LMMs can generate MMQAs but exhibit gaps in factual grounding, while top judges align well with human judgments; the Q-Mirror agent significantly improves MMQA quality (AVG 78.90 to 85.22 and pass rate 72% to 95%). This work offers a scalable, cost-effective path to large-scale multimodal scientific benchmarks, enabling more robust evaluation and training of advanced reasoning models in science.

Abstract

High-quality, multi-modal benchmarks are crucial for advancing scientific reasoning in large models yet their manual creation is costly and unscalable. To address this bottleneck, we explore the potential for transforming Text-Only QA Pairs (TQAs) into high-quality Multi-Modal QA Pairs (MMQAs), which include three parts: 1) Task Definition \& Evaluation Rubric: We develop a TQA-to-MMQA framework and establish a comprehensive, multi-dimensional MMQA quality rubric that provides principles for the transformation. 2) Benchmark Construction: Then we construct two extensive benchmarks to rigorously evaluate state-of-the-art generation \& understanding models on the distinct tasks of MMQA generation \& MMQA quality evaluation. 3) Preliminary Solution: We develop an agentic system (Q-Mirror), which operationalizes our framework by integrating MMQA generation and evaluation into a closed loop for iterative refinement. Our experiments show that while state-of-the-art models can generate MMQAs, their outputs still leave substantial gaps, underscoring the need for reliable evaluation. We further demonstrate that top-tier understanding models align closely with human judgment in MMQA quality assessment. Leveraging both insights, the Q-Mirror agent raises average scores from 78.90 to 85.22 and pass rates from 72\% to 95\%, offering a practical path to large-scale scientific benchmarks.

Q-Mirror: Unlocking the Multi-Modal Potential of Scientific Text-Only QA Pairs

TL;DR

Q-Mirror tackles the bottleneck of scarce multi-modal scientific benchmarks by proposing a systematic TQA-to-MMQA transformation framework, a dual-benchmark setup for MMQA generation and evaluation, and an autonomous agent that iteratively refines MMQAs. The framework hinges on a multi-dimensional quality rubric with Information Consistency, Cross-Modal Integration, and Standalone Quality, and defines a formal AVG scoring mechanism. Empirical results show current LMMs can generate MMQAs but exhibit gaps in factual grounding, while top judges align well with human judgments; the Q-Mirror agent significantly improves MMQA quality (AVG 78.90 to 85.22 and pass rate 72% to 95%). This work offers a scalable, cost-effective path to large-scale multimodal scientific benchmarks, enabling more robust evaluation and training of advanced reasoning models in science.

Abstract

High-quality, multi-modal benchmarks are crucial for advancing scientific reasoning in large models yet their manual creation is costly and unscalable. To address this bottleneck, we explore the potential for transforming Text-Only QA Pairs (TQAs) into high-quality Multi-Modal QA Pairs (MMQAs), which include three parts: 1) Task Definition \& Evaluation Rubric: We develop a TQA-to-MMQA framework and establish a comprehensive, multi-dimensional MMQA quality rubric that provides principles for the transformation. 2) Benchmark Construction: Then we construct two extensive benchmarks to rigorously evaluate state-of-the-art generation \& understanding models on the distinct tasks of MMQA generation \& MMQA quality evaluation. 3) Preliminary Solution: We develop an agentic system (Q-Mirror), which operationalizes our framework by integrating MMQA generation and evaluation into a closed loop for iterative refinement. Our experiments show that while state-of-the-art models can generate MMQAs, their outputs still leave substantial gaps, underscoring the need for reliable evaluation. We further demonstrate that top-tier understanding models align closely with human judgment in MMQA quality assessment. Leveraging both insights, the Q-Mirror agent raises average scores from 78.90 to 85.22 and pass rates from 72\% to 95\%, offering a practical path to large-scale scientific benchmarks.

Paper Structure

This paper contains 43 sections, 4 equations, 11 figures, 5 tables, 1 algorithm.

Figures (11)

  • Figure 1: Overview of the motivation and key contributions, which illustrate: 1) the need for multi-modal data in scientific benchmark to advance model development, 2) the imbalance between text-only and multi-modal resources, 3) the latent potential of TQAs for multi-modal transformation, and 4) our contributions, including the transformation framework and quality principles, MMQA generation and evaluation benchmarks, and the Q-Mirror Agent for improving MMQA quality.
  • Figure 2: An illustration of the TQA-to-MMQA transformation, expert annotation, and LMM judge evaluation, including: 1) the full conversion of a TQA into an MMQA, 2) the expert annotation sample for the corresponding MMQA generation case, based on the proposed quality rubric, and 3) the evaluation results of LMMs with their correctness indicated against expert annotations.
  • Figure 3: Performance comparison for the top-3 judge models and Q-Mirror Evaluator (judge ensemble group). In each chart, the red line shows the ratio between the overall average of all ten judges and the average of the top three ($\text{Avg}{10} / \text{Avg}{\text{Top-3}}$). The blue line indicates the relative performance of the current judge (or ensemble) compared with the top-three average ($\text{Score}{\text{current}} /\text{Avg}{\text{Top-3}}$).
  • Figure 4: Dimensional SRCC correlations of MMQA evaluation, following the setting of Redundancy Principle Redundancy. Higher and darker bars indicate stronger agreement between the corresponding dimensions.
  • Figure 5: Custom-built web interface for human evaluation. Annotators view the original TQA and the transformed MMQA (text and image), then score the MMQA according to fine-grained metrics. The interface enforces randomized assignment, and automatically exports results in JSON format to ensure independence and reproducibility.
  • ...and 6 more figures