Table of Contents
Fetching ...

MMKG-RDS: Reasoning Data Synthesis via Deep Mining of Multimodal Knowledge Graphs

Lun Zhan, Feng Xiong, Huanyong Liu, Feng Zhang, Yuhui Yin

TL;DR

MMKG-RDS is proposed, a flexible framework for reasoning data synthesis that leverages multimodal knowledge graphs and supports fine-grained knowledge extraction, customizable path sampling, and multidimensional data quality scoring.

Abstract

Synthesizing high-quality training data is crucial for enhancing domain models' reasoning abilities. Existing methods face limitations in long-tail knowledge coverage, effectiveness verification, and interpretability. Knowledge-graph-based approaches still fall short in functionality, granularity, customizability, and evaluation. To address these issues, we propose MMKG-RDS, a flexible framework for reasoning data synthesis that leverages multimodal knowledge graphs. It supports fine-grained knowledge extraction, customizable path sampling, and multidimensional data quality scoring. We validate MMKG-RDS with the MMKG-RDS-Bench dataset, covering five domains, 17 task types, and 14,950 samples. Experimental results show fine-tuning Qwen3 models (0.6B/8B/32B) on a small number of synthesized samples improves reasoning accuracy by 9.2%. The framework also generates distinct data, challenging existing models on tasks involving tables and formulas, useful for complex benchmark construction. The dataset and code are available at https://github.com/360AILAB-NLP/MMKG-RDS

MMKG-RDS: Reasoning Data Synthesis via Deep Mining of Multimodal Knowledge Graphs

TL;DR

MMKG-RDS is proposed, a flexible framework for reasoning data synthesis that leverages multimodal knowledge graphs and supports fine-grained knowledge extraction, customizable path sampling, and multidimensional data quality scoring.

Abstract

Synthesizing high-quality training data is crucial for enhancing domain models' reasoning abilities. Existing methods face limitations in long-tail knowledge coverage, effectiveness verification, and interpretability. Knowledge-graph-based approaches still fall short in functionality, granularity, customizability, and evaluation. To address these issues, we propose MMKG-RDS, a flexible framework for reasoning data synthesis that leverages multimodal knowledge graphs. It supports fine-grained knowledge extraction, customizable path sampling, and multidimensional data quality scoring. We validate MMKG-RDS with the MMKG-RDS-Bench dataset, covering five domains, 17 task types, and 14,950 samples. Experimental results show fine-tuning Qwen3 models (0.6B/8B/32B) on a small number of synthesized samples improves reasoning accuracy by 9.2%. The framework also generates distinct data, challenging existing models on tasks involving tables and formulas, useful for complex benchmark construction. The dataset and code are available at https://github.com/360AILAB-NLP/MMKG-RDS
Paper Structure (34 sections, 9 figures, 10 tables, 1 algorithm)

This paper contains 34 sections, 9 figures, 10 tables, 1 algorithm.

Figures (9)

  • Figure 1: High-level architecture of MMKG-RDS.The system consists of five layers: Preprocessing, KG Builder, Storage, Synthesis, and Analysis. MMKG-RDS can dynamically adjust various components and task configurations based on user input, enabling the generation of high-quality, task-specific data for downstream LLM applications.
  • Figure 2: MMKG-RDS comprises: (a) Document parsing: extracting and structuring multimodal documents; (b) Knowledge graph building: a five-stage construction process; (c) Data generation: synthesizing data via customizable sampling; (d) Quality filtering: deduplication, quality control, and scoring for data distribution tuning.
  • Figure 3: Statistical Information of MMKG-RDS-Bench: (a) Distribution of MMKG-RDS-Bench Across Different Domains; (b) Distribution of MMKG-RDS-Bench Across Different Tasks; (c) Distribution of Question Lengths in MMKG-RDS-Bench.
  • Figure 4: (left) Accuracy(%) on different tasks across the Qwen3-series models. (right) Accuracy(%) on different tasks across the Qwen3-VL-series models.
  • Figure 5: Neo4j-based Knowledge Graph in the Chemistry Domain
  • ...and 4 more figures