Table of Contents
Fetching ...

Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective

Wangjie You, Xusheng Wang, Xing Wang, Wenxiang Jiao, Chao Feng, Juntao Li, Min Zhang

TL;DR

The paper introduces CCMOR, a Chinese commonsense multi-hop reasoning benchmark built by expanding a domain-balanced seed QA corpus with an LLM-driven, verifiable expansion pipeline and expert human validation. It combines seed sampling, iterative sub-question generation, LLM-based verification, and multi-hop composition to produce cross-domain reasoning paths with intermediate steps. Experimental results show that state-of-the-art LLMs struggle with long-tail knowledge and complex reasoning, though retrieval-augmented generation yields significant gains and deliberate reasoning improves final answers. The work provides a culturally grounded, verifiable resource for evaluating and advancing Chinese multi-hop reasoning in large language models, with robust evaluation and analysis of prompting strategies and reasoning styles.

Abstract

While Large Language Models (LLMs) have demonstrated advanced reasoning capabilities, their comprehensive evaluation in general Chinese-language contexts remains understudied. To bridge this gap, we propose Chinese Commonsense Multi-hop Reasoning (CCMOR), a novel benchmark designed to evaluate LLMs' ability to integrate Chinese-specific factual knowledge with multi-step logical reasoning. Specifically, we first construct a domain-balanced seed set from existing QA datasets, then develop an LLM-powered pipeline to generate multi-hop questions anchored on factual unit chains. To ensure the quality of resulting dataset, we implement a human-in-the-loop verification system, where domain experts systematically validate and refine the generated questions. Using CCMOR, we evaluate state-of-the-art LLMs, demonstrating persistent limitations in LLMs' ability to process long-tail knowledge and execute knowledge-intensive reasoning. Notably, retrieval-augmented generation substantially mitigates these knowledge gaps, yielding significant performance gains.

Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective

TL;DR

The paper introduces CCMOR, a Chinese commonsense multi-hop reasoning benchmark built by expanding a domain-balanced seed QA corpus with an LLM-driven, verifiable expansion pipeline and expert human validation. It combines seed sampling, iterative sub-question generation, LLM-based verification, and multi-hop composition to produce cross-domain reasoning paths with intermediate steps. Experimental results show that state-of-the-art LLMs struggle with long-tail knowledge and complex reasoning, though retrieval-augmented generation yields significant gains and deliberate reasoning improves final answers. The work provides a culturally grounded, verifiable resource for evaluating and advancing Chinese multi-hop reasoning in large language models, with robust evaluation and analysis of prompting strategies and reasoning styles.

Abstract

While Large Language Models (LLMs) have demonstrated advanced reasoning capabilities, their comprehensive evaluation in general Chinese-language contexts remains understudied. To bridge this gap, we propose Chinese Commonsense Multi-hop Reasoning (CCMOR), a novel benchmark designed to evaluate LLMs' ability to integrate Chinese-specific factual knowledge with multi-step logical reasoning. Specifically, we first construct a domain-balanced seed set from existing QA datasets, then develop an LLM-powered pipeline to generate multi-hop questions anchored on factual unit chains. To ensure the quality of resulting dataset, we implement a human-in-the-loop verification system, where domain experts systematically validate and refine the generated questions. Using CCMOR, we evaluate state-of-the-art LLMs, demonstrating persistent limitations in LLMs' ability to process long-tail knowledge and execute knowledge-intensive reasoning. Notably, retrieval-augmented generation substantially mitigates these knowledge gaps, yielding significant performance gains.

Paper Structure

This paper contains 29 sections, 1 equation, 8 figures, 10 tables.

Figures (8)

  • Figure 1: An overview of the data construction process. Examples are provided in English for readability.
  • Figure 2: Domain-wise LLM-as-Judge accuracy for different models. CC, HU, ET, LA, SO and NS represent “Chinese Culture”, “Humanities”, “Engineering and Technology”, “Life and Art”, “Society”, and “Natural Science”, respectively.
  • Figure 3: Performance of models with different reasoning styles in the sub-question answering (SQA) and overall answering (OA) settings. Blue represent system-1-style models, while Orange represent system-2-style models.
  • Figure 4: LLM-as-Judge accuracy of different baselinse models with RAG.
  • Figure 5: The prompt for reclassifying seed factual questions into six domains.
  • ...and 3 more figures