Table of Contents
Fetching ...

An Improved Method for Class-specific Keyword Extraction: A Case Study in the German Business Registry

Stephen Meisenbacher, Tim Schopf, Weixin Yan, Patrick Holl, Florian Matthes

TL;DR

Class-specific keyword extraction addresses the need for targeted keywords aligned with predefined classes, a gap in unsupervised methods. The authors propose an iterative, seed-guided pipeline built on KeyBERT that concentrates on seed embeddings, ranks candidates with a two-part cosine-based score, and expands seeds across $n\_iterations$ batches using top items from the $percentile\_newseed$ percentile and $number\_newseed$ per iteration. Evaluation on a German Handelsregister corpus mapped to the $WZ 2008$ sectors shows the approach achieves state-of-the-art performance, particularly for exact and lemma match evaluations, against RAKE, YAKE, KeyBERT, and Guided KeyBERT. The work provides code at https://github.com/sjmeis/CSKE and suggests strong practical impact for targeted information retrieval, topic modeling, and sector classification through precise class-specific keyword extraction.

Abstract

The task of $\textit{keyword extraction}$ is often an important initial step in unsupervised information extraction, forming the basis for tasks such as topic modeling or document classification. While recent methods have proven to be quite effective in the extraction of keywords, the identification of $\textit{class-specific}$ keywords, or only those pertaining to a predefined class, remains challenging. In this work, we propose an improved method for class-specific keyword extraction, which builds upon the popular $\textbf{KeyBERT}$ library to identify only keywords related to a class described by $\textit{seed keywords}$. We test this method using a dataset of German business registry entries, where the goal is to classify each business according to an economic sector. Our results reveal that our method greatly improves upon previous approaches, setting a new standard for $\textit{class-specific}$ keyword extraction.

An Improved Method for Class-specific Keyword Extraction: A Case Study in the German Business Registry

TL;DR

Class-specific keyword extraction addresses the need for targeted keywords aligned with predefined classes, a gap in unsupervised methods. The authors propose an iterative, seed-guided pipeline built on KeyBERT that concentrates on seed embeddings, ranks candidates with a two-part cosine-based score, and expands seeds across batches using top items from the percentile and per iteration. Evaluation on a German Handelsregister corpus mapped to the sectors shows the approach achieves state-of-the-art performance, particularly for exact and lemma match evaluations, against RAKE, YAKE, KeyBERT, and Guided KeyBERT. The work provides code at https://github.com/sjmeis/CSKE and suggests strong practical impact for targeted information retrieval, topic modeling, and sector classification through precise class-specific keyword extraction.

Abstract

The task of is often an important initial step in unsupervised information extraction, forming the basis for tasks such as topic modeling or document classification. While recent methods have proven to be quite effective in the extraction of keywords, the identification of keywords, or only those pertaining to a predefined class, remains challenging. In this work, we propose an improved method for class-specific keyword extraction, which builds upon the popular library to identify only keywords related to a class described by . We test this method using a dataset of German business registry entries, where the goal is to classify each business according to an economic sector. Our results reveal that our method greatly improves upon previous approaches, setting a new standard for keyword extraction.
Paper Structure (14 sections, 1 figure, 1 table)

This paper contains 14 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Our class-specific keyword extraction pipeline. With a document corpus and class-specific keyword sets as inputs, we iterate sequentially over batches of the corpus, using a modified KeyBERT and a two-part scoring scheme. Top keywords are added to the seed keywords for the next iteration, until a final set of keywords is achieved.