Table of Contents
Fetching ...

ChiMDQA: Towards Comprehensive Chinese Document QA with Fine-grained Evaluation

Jing Gao, Shutiao Luo, Yumeng Liu, Yuanming Li, Hongji Zeng

TL;DR

ChiMDQA tackles the scarcity of diverse Chinese QA benchmarks for long-form, multi-domain documents by constructing a six-domain dataset of 6,068 QA pairs and a fine-grained, two-level question taxonomy. It details a rigorous four-stage dataset construction and a comprehensive evaluation framework that combines non-RAG and retrieval-augmented generation metrics, including a RAGChecker-based suite for retrieval and generation quality. Experimental results show GPT-4o achieving leading performance across factual and open-ended questions, and that RAG can improve factual accuracy and reduce generation uncertainty while exposing hallucination challenges. The work provides a practical benchmark and methodological blueprint for advancing Chinese document QA in real-world business contexts.

Abstract

With the rapid advancement of natural language processing (NLP) technologies, the demand for high-quality Chinese document question-answering datasets is steadily growing. To address this issue, we present the Chinese Multi-Document Question Answering Dataset(ChiMDQA), specifically designed for downstream business scenarios across prevalent domains including academic, education, finance, law, medical treatment, and news. ChiMDQA encompasses long-form documents from six distinct fields, consisting of 6,068 rigorously curated, high-quality question-answer (QA) pairs further classified into ten fine-grained categories. Through meticulous document screening and a systematic question-design methodology, the dataset guarantees both diversity and high quality, rendering it applicable to various NLP tasks such as document comprehension, knowledge extraction, and intelligent QA systems. Additionally, this paper offers a comprehensive overview of the dataset's design objectives, construction methodologies, and fine-grained evaluation system, supplying a substantial foundation for future research and practical applications in Chinese QA. The code and data are available at: https://anonymous.4open.science/r/Foxit-CHiMDQA/.

ChiMDQA: Towards Comprehensive Chinese Document QA with Fine-grained Evaluation

TL;DR

ChiMDQA tackles the scarcity of diverse Chinese QA benchmarks for long-form, multi-domain documents by constructing a six-domain dataset of 6,068 QA pairs and a fine-grained, two-level question taxonomy. It details a rigorous four-stage dataset construction and a comprehensive evaluation framework that combines non-RAG and retrieval-augmented generation metrics, including a RAGChecker-based suite for retrieval and generation quality. Experimental results show GPT-4o achieving leading performance across factual and open-ended questions, and that RAG can improve factual accuracy and reduce generation uncertainty while exposing hallucination challenges. The work provides a practical benchmark and methodological blueprint for advancing Chinese document QA in real-world business contexts.

Abstract

With the rapid advancement of natural language processing (NLP) technologies, the demand for high-quality Chinese document question-answering datasets is steadily growing. To address this issue, we present the Chinese Multi-Document Question Answering Dataset(ChiMDQA), specifically designed for downstream business scenarios across prevalent domains including academic, education, finance, law, medical treatment, and news. ChiMDQA encompasses long-form documents from six distinct fields, consisting of 6,068 rigorously curated, high-quality question-answer (QA) pairs further classified into ten fine-grained categories. Through meticulous document screening and a systematic question-design methodology, the dataset guarantees both diversity and high quality, rendering it applicable to various NLP tasks such as document comprehension, knowledge extraction, and intelligent QA systems. Additionally, this paper offers a comprehensive overview of the dataset's design objectives, construction methodologies, and fine-grained evaluation system, supplying a substantial foundation for future research and practical applications in Chinese QA. The code and data are available at: https://anonymous.4open.science/r/Foxit-CHiMDQA/.

Paper Structure

This paper contains 23 sections, 2 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Examples of Question Types and Their Topic Coverage in the ChiMDQA Dataset.
  • Figure 2: The Framework of ChiMDQA Dataset Construction Process.(S:Start;M:Middle;E:End;Ori Ans:Original Answer)
  • Figure 3: Performance Comparison of Models with and without RAG.
  • Figure 4: Results of different models for six topics. (Top: Evaluation metrics for factual questions; Bottom: Evaluation metrics for open-ended questions)