Table of Contents
Fetching ...

Compound-QA: A Benchmark for Evaluating LLMs on Compound Questions

Yutao Hou, Yajing Luo, Zhiwen Ruan, Hongru Wang, Weifeng Ge, Yun Chen, Guanhua Chen

TL;DR

This work introduces Compound-QA, a benchmark for evaluating large language models on compound questions—queries containing multiple interrelated sub-questions. It builds CQ-Syn to synthesize 1,500 compound QA samples across five question types and three cognitive dimensions (understanding, reasoning, knowledge), derived from existing QA datasets and verified via human review. Nine open-source LLMs are evaluated using a three-dimension framework (Comprehensiveness, Correctness, Diversity) with automatic matching against reference answers and robust position-bias controls, revealing substantial performance gaps on compound questions relative to non-compound tasks. The study further explores improvement strategies, finding that LoRA fine-tuning and, to a lesser extent, chain-of-thought/decomposition approaches substantially boost performance while preserving generalization on other benchmarks. Overall, Compound-QA provides a focused framework to diagnose and enhance multi-question understanding in LLMs, with potential extensions to multimodal settings in future work.

Abstract

Large language models (LLMs) demonstrate remarkable performance across various tasks, prompting researchers to develop diverse evaluation benchmarks. However, most benchmarks typically measure the ability of LLMs to respond to individual questions, neglecting the complex interactions in real-world applications. We introduce Compound Question Synthesis (CQ-Syn) to build Compound-QA, a benchmark targeting questions composed of multiple interrelated sub-questions. This benchmark is derived from existing QA datasets, annotated with proprietary LLMs, and verified by humans for accuracy. It encompasses five categories: Factual-Statement, Cause-and-Effect, Hypothetical-Analysis, Comparison-and-Selection, and Evaluation-and-Suggestion. It evaluates the LLM capability in terms of three dimensions, including understanding, reasoning, and knowledge. Evaluating nine open-source LLMs on Compound-QA reveals that their performance on compound questions is notably lower than on non-compound questions. We further explore strategies to enhance LLMs' handling of compound questions, and our results show that these methods substantially improve models' comprehension and reasoning abilities.

Compound-QA: A Benchmark for Evaluating LLMs on Compound Questions

TL;DR

This work introduces Compound-QA, a benchmark for evaluating large language models on compound questions—queries containing multiple interrelated sub-questions. It builds CQ-Syn to synthesize 1,500 compound QA samples across five question types and three cognitive dimensions (understanding, reasoning, knowledge), derived from existing QA datasets and verified via human review. Nine open-source LLMs are evaluated using a three-dimension framework (Comprehensiveness, Correctness, Diversity) with automatic matching against reference answers and robust position-bias controls, revealing substantial performance gaps on compound questions relative to non-compound tasks. The study further explores improvement strategies, finding that LoRA fine-tuning and, to a lesser extent, chain-of-thought/decomposition approaches substantially boost performance while preserving generalization on other benchmarks. Overall, Compound-QA provides a focused framework to diagnose and enhance multi-question understanding in LLMs, with potential extensions to multimodal settings in future work.

Abstract

Large language models (LLMs) demonstrate remarkable performance across various tasks, prompting researchers to develop diverse evaluation benchmarks. However, most benchmarks typically measure the ability of LLMs to respond to individual questions, neglecting the complex interactions in real-world applications. We introduce Compound Question Synthesis (CQ-Syn) to build Compound-QA, a benchmark targeting questions composed of multiple interrelated sub-questions. This benchmark is derived from existing QA datasets, annotated with proprietary LLMs, and verified by humans for accuracy. It encompasses five categories: Factual-Statement, Cause-and-Effect, Hypothetical-Analysis, Comparison-and-Selection, and Evaluation-and-Suggestion. It evaluates the LLM capability in terms of three dimensions, including understanding, reasoning, and knowledge. Evaluating nine open-source LLMs on Compound-QA reveals that their performance on compound questions is notably lower than on non-compound questions. We further explore strategies to enhance LLMs' handling of compound questions, and our results show that these methods substantially improve models' comprehension and reasoning abilities.

Paper Structure

This paper contains 14 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Examples of non-compound (left) and compound (right) questions: the former poses multiple questions across turns, while the latter combines them within a single turn.
  • Figure 2: The overview of CQ-Syn Data Synthesis.
  • Figure 3: Performance comparison of LLaMA and InternLM when answering compound and non-compound questions.
  • Figure 4: Comparative Performance of Different Improvement Methods on Compound-QA.