Table of Contents
Fetching ...

ROUGE-K: Do Your Summaries Have Keywords?

Sotaro Takeshita, Simone Paolo Ponzetto, Kai Eckert

TL;DR

This paper introduces ROUGE-K, a keyword-oriented evaluation metric that measures how well summaries include pre-defined keywords by computing recall over keywords, formalized as $R\text{-}K = \frac{\mathrm{Count}(\text{kws} \cap \text{n-grams})}{\mathrm{Count}(\text{kws})}$. It proposes an automatic keyword extraction heuristic based on multi-reference overlap, and validates ROUGE-K against human judgments, showing higher agreement than traditional ROUGE and BERTScore on relevance. Through experiments on SciTLDR, XSum, and ScisummNet, the authors demonstrate that strong baselines often miss essential keywords, and ROUGE-K provides a more discriminative view of keyword coverage. Finally, they present four lightweight TF-IDF–guided approaches (RwEnc, RwGen, TDMTL, TDSum) to steer models toward including more keywords while preserving overall summarization quality, offering practical improvements for keyword-rich summarization tasks.

Abstract

Keywords, that is, content-relevant words in summaries play an important role in efficient information conveyance, making it critical to assess if system-generated summaries contain such informative words during evaluation. However, existing evaluation metrics for extreme summarization models do not pay explicit attention to keywords in summaries, leaving developers ignorant of their presence. To address this issue, we present a keyword-oriented evaluation metric, dubbed ROUGE-K, which provides a quantitative answer to the question of -- \textit{How well do summaries include keywords?} Through the lens of this keyword-aware metric, we surprisingly find that a current strong baseline model often misses essential information in their summaries. Our analysis reveals that human annotators indeed find the summaries with more keywords to be more relevant to the source documents. This is an important yet previously overlooked aspect in evaluating summarization systems. Finally, to enhance keyword inclusion, we propose four approaches for incorporating word importance into a transformer-based model and experimentally show that it enables guiding models to include more keywords while keeping the overall quality. Our code is released at https://github.com/sobamchan/rougek.

ROUGE-K: Do Your Summaries Have Keywords?

TL;DR

This paper introduces ROUGE-K, a keyword-oriented evaluation metric that measures how well summaries include pre-defined keywords by computing recall over keywords, formalized as . It proposes an automatic keyword extraction heuristic based on multi-reference overlap, and validates ROUGE-K against human judgments, showing higher agreement than traditional ROUGE and BERTScore on relevance. Through experiments on SciTLDR, XSum, and ScisummNet, the authors demonstrate that strong baselines often miss essential keywords, and ROUGE-K provides a more discriminative view of keyword coverage. Finally, they present four lightweight TF-IDF–guided approaches (RwEnc, RwGen, TDMTL, TDSum) to steer models toward including more keywords while preserving overall summarization quality, offering practical improvements for keyword-rich summarization tasks.

Abstract

Keywords, that is, content-relevant words in summaries play an important role in efficient information conveyance, making it critical to assess if system-generated summaries contain such informative words during evaluation. However, existing evaluation metrics for extreme summarization models do not pay explicit attention to keywords in summaries, leaving developers ignorant of their presence. To address this issue, we present a keyword-oriented evaluation metric, dubbed ROUGE-K, which provides a quantitative answer to the question of -- \textit{How well do summaries include keywords?} Through the lens of this keyword-aware metric, we surprisingly find that a current strong baseline model often misses essential information in their summaries. Our analysis reveals that human annotators indeed find the summaries with more keywords to be more relevant to the source documents. This is an important yet previously overlooked aspect in evaluating summarization systems. Finally, to enhance keyword inclusion, we propose four approaches for incorporating word importance into a transformer-based model and experimentally show that it enables guiding models to include more keywords while keeping the overall quality. Our code is released at https://github.com/sobamchan/rougek.
Paper Structure (26 sections, 3 equations, 2 figures, 11 tables)