Table of Contents
Fetching ...

LongKey: Keyphrase Extraction for Long Documents

Jeovane Honorio Alves, Radu State, Cinthia Obladen de Almendra Freitas, Jean Paul Barddal

TL;DR

This work tackles the challenge of keyphrase extraction in long-context documents by introducing LongKey, an encoder-based framework that processes extended text with Longformer and a novel max-pooled keyphrase embedding pooler. The approach uses a three-stage pipeline—Longformer-based word embeddings, CNN-derived n-gram keyphrase embeddings, and joint candidate scoring with a Margin Ranking loss and a Binary Cross-Entropy loss—to produce context-aware keyphrase representations. LongKey expands token support to up to $96{,}000$ tokens by chunking and extends positional embeddings to $8{,}192$, facilitating robust long-form reasoning, while maintaining efficiency via a max-pooled aggregation of keyphrase occurrences. Empirically, LongKey outperforms unsupervised and language-model–based baselines on the LDKP datasets and six unseen domains, with particularly strong gains from the keyphrase embedding pooler, and demonstrates competitive performance on short-context datasets. This work advances practical long-context keyphrase extraction, with implications for improved indexing, summarization, and retrieval in large-scale document corpora, and points to future work on longer keyphrases and broader context expansion.

Abstract

In an era of information overload, manually annotating the vast and growing corpus of documents and scholarly papers is increasingly impractical. Automated keyphrase extraction addresses this challenge by identifying representative terms within texts. However, most existing methods focus on short documents (up to 512 tokens), leaving a gap in processing long-context documents. In this paper, we introduce LongKey, a novel framework for extracting keyphrases from lengthy documents, which uses an encoder-based language model to capture extended text intricacies. LongKey uses a max-pooling embedder to enhance keyphrase candidate representation. Validated on the comprehensive LDKP datasets and six diverse, unseen datasets, LongKey consistently outperforms existing unsupervised and language model-based keyphrase extraction methods. Our findings demonstrate LongKey's versatility and superior performance, marking an advancement in keyphrase extraction for varied text lengths and domains.

LongKey: Keyphrase Extraction for Long Documents

TL;DR

This work tackles the challenge of keyphrase extraction in long-context documents by introducing LongKey, an encoder-based framework that processes extended text with Longformer and a novel max-pooled keyphrase embedding pooler. The approach uses a three-stage pipeline—Longformer-based word embeddings, CNN-derived n-gram keyphrase embeddings, and joint candidate scoring with a Margin Ranking loss and a Binary Cross-Entropy loss—to produce context-aware keyphrase representations. LongKey expands token support to up to tokens by chunking and extends positional embeddings to , facilitating robust long-form reasoning, while maintaining efficiency via a max-pooled aggregation of keyphrase occurrences. Empirically, LongKey outperforms unsupervised and language-model–based baselines on the LDKP datasets and six unseen domains, with particularly strong gains from the keyphrase embedding pooler, and demonstrates competitive performance on short-context datasets. This work advances practical long-context keyphrase extraction, with implications for improved indexing, summarization, and retrieval in large-scale document corpora, and points to future work on longer keyphrases and broader context expansion.

Abstract

In an era of information overload, manually annotating the vast and growing corpus of documents and scholarly papers is increasingly impractical. Automated keyphrase extraction addresses this challenge by identifying representative terms within texts. However, most existing methods focus on short documents (up to 512 tokens), leaving a gap in processing long-context documents. In this paper, we introduce LongKey, a novel framework for extracting keyphrases from lengthy documents, which uses an encoder-based language model to capture extended text intricacies. LongKey uses a max-pooling embedder to enhance keyphrase candidate representation. Validated on the comprehensive LDKP datasets and six diverse, unseen datasets, LongKey consistently outperforms existing unsupervised and language model-based keyphrase extraction methods. Our findings demonstrate LongKey's versatility and superior performance, marking an advancement in keyphrase extraction for varied text lengths and domains.

Paper Structure

This paper contains 15 sections, 19 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Overall workflow of the LongKey approach.
  • Figure 2: F1 scores based on the document length of the LongKey and JointKPE methods with different encoders applied to the LDKP3K dataset. F1@K for six range of document length (from less than 512 words to more than 8192), where $K=[1,10]$. Dashed lines are the F1@$\mathcal{O}$ for the specific interval.