ChiMDQA: Towards Comprehensive Chinese Document QA with Fine-grained Evaluation
Jing Gao, Shutiao Luo, Yumeng Liu, Yuanming Li, Hongji Zeng
TL;DR
ChiMDQA tackles the scarcity of diverse Chinese QA benchmarks for long-form, multi-domain documents by constructing a six-domain dataset of 6,068 QA pairs and a fine-grained, two-level question taxonomy. It details a rigorous four-stage dataset construction and a comprehensive evaluation framework that combines non-RAG and retrieval-augmented generation metrics, including a RAGChecker-based suite for retrieval and generation quality. Experimental results show GPT-4o achieving leading performance across factual and open-ended questions, and that RAG can improve factual accuracy and reduce generation uncertainty while exposing hallucination challenges. The work provides a practical benchmark and methodological blueprint for advancing Chinese document QA in real-world business contexts.
Abstract
With the rapid advancement of natural language processing (NLP) technologies, the demand for high-quality Chinese document question-answering datasets is steadily growing. To address this issue, we present the Chinese Multi-Document Question Answering Dataset(ChiMDQA), specifically designed for downstream business scenarios across prevalent domains including academic, education, finance, law, medical treatment, and news. ChiMDQA encompasses long-form documents from six distinct fields, consisting of 6,068 rigorously curated, high-quality question-answer (QA) pairs further classified into ten fine-grained categories. Through meticulous document screening and a systematic question-design methodology, the dataset guarantees both diversity and high quality, rendering it applicable to various NLP tasks such as document comprehension, knowledge extraction, and intelligent QA systems. Additionally, this paper offers a comprehensive overview of the dataset's design objectives, construction methodologies, and fine-grained evaluation system, supplying a substantial foundation for future research and practical applications in Chinese QA. The code and data are available at: https://anonymous.4open.science/r/Foxit-CHiMDQA/.
