QAPyramid: Fine-grained Evaluation of Content Selection for Text Summarization
Shiyue Zhang, David Wan, Arie Cattan, Ayal Klein, Ido Dagan, Mohit Bansal
TL;DR
QAPyramid reframes content selection evaluation for text summarization by decomposing reference summaries into QA-SRL-based question-answer pairs and assessing their presence in system outputs. It couples crowdsourced QA generation with presence judgments, showing high inter-annotator agreement and finer-grained credit than ACU-based approaches. The authors also develop semi-automatic and fully automatic variants (SemiAutoQAPyramid and AutoQAPyramid) that align strongly with gold QAPyramid scores, outperforming traditional metrics in correlational validity. The work offers a scalable, reproducible, and semantically sensitive framework for evaluating summary content selection, with potential applicability to broader language-generation tasks.
Abstract
How to properly conduct human evaluations for text summarization is a longstanding challenge. The Pyramid human evaluation protocol, which assesses content selection by breaking the reference summary into subunits and verifying their presence in the system summary, has been widely adopted. However, it suffers from a lack of systematicity in the definition and granularity of the sub-units. We address these problems by proposing QAPyramid, which decomposes each reference summary into finer-grained question-answer (QA) pairs according to the QA-SRL framework. We collect QA-SRL annotations for reference summaries from CNN/DM and evaluate 10 summarization systems, resulting in 8.9K QA-level annotations. We show that, compared to Pyramid, QAPyramid provides more systematic and fine-grained content selection evaluation while maintaining high inter-annotator agreement without needing expert annotations. Furthermore, we propose metrics that automate the evaluation pipeline and achieve higher correlations with QAPyramid than other widely adopted metrics.
