SlovKE: A Large-Scale Dataset and LLM Evaluation for Slovak Keyphrase Extraction

David Števaňák; Marek Šuppa

SlovKE: A Large-Scale Dataset and LLM Evaluation for Slovak Keyphrase Extraction

David Števaňák, Marek Šuppa

Abstract

Keyphrase extraction for morphologically rich, low-resource languages remains understudied, largely due to the scarcity of suitable evaluation datasets. We address this gap for Slovak by constructing a dataset of 227,432 scientific abstracts with author-assigned keyphrases -- scraped and systematically cleaned from the Slovak Central Register of Theses -- representing a 25-fold increase over the largest prior Slovak resource and approaching the scale of established English benchmarks such as KP20K. Using this dataset, we benchmark three unsupervised baselines (YAKE, TextRank, KeyBERT with SlovakBERT embeddings) and evaluate KeyLLM, an LLM-based extraction method using GPT-3.5-turbo. Unsupervised baselines achieve at most 11.6\% exact-match $F1@6$, with a large gap to partial matching (up to 51.5\%), reflecting the difficulty of matching inflected surface forms to author-assigned keyphrases. KeyLLM narrows this exact--partial gap, producing keyphrases closer to the canonical forms assigned by authors, while manual evaluation on 100 documents ($κ= 0.61$) confirms that KeyLLM captures relevant concepts that automated exact matching underestimates. Our analysis identifies morphological mismatch as the dominant failure mode for statistical methods -- a finding relevant to other inflected languages. The dataset (https://huggingface.co/datasets/NaiveNeuron/SlovKE) and evaluation code (https://github.com/NaiveNeuron/SlovKE) are publicly available.

SlovKE: A Large-Scale Dataset and LLM Evaluation for Slovak Keyphrase Extraction

Abstract

, with a large gap to partial matching (up to 51.5\%), reflecting the difficulty of matching inflected surface forms to author-assigned keyphrases. KeyLLM narrows this exact--partial gap, producing keyphrases closer to the canonical forms assigned by authors, while manual evaluation on 100 documents (

) confirms that KeyLLM captures relevant concepts that automated exact matching underestimates. Our analysis identifies morphological mismatch as the dominant failure mode for statistical methods -- a finding relevant to other inflected languages. The dataset (https://huggingface.co/datasets/NaiveNeuron/SlovKE) and evaluation code (https://github.com/NaiveNeuron/SlovKE) are publicly available.

Paper Structure (37 sections, 11 figures, 3 tables)

This paper contains 37 sections, 11 figures, 3 tables.

Introduction
Related Work
Methodology
YAKE
TextRank
KeyBERT
KeyLLM
Dataset
Data Collection
Data Cleaning
Dataset Statistics
Evaluation
Evaluation Metrics
Results
Comparing Baseline Models
...and 22 more sections

Figures (11)

Figure 1: Example abstract from the Test22K dataset (Slovak). Color coding indicates keyphrase occurrences: author-assigned keyphrases appear in the text in various inflected forms (red: exact match, teal: partial overlap, green: single-word match, blue: shared fragment). Note that surface forms in the abstract (e.g., rozvojového potenciálu, genitive) differ from the canonical keyphrase form (Rozvojový potenciál, nominative), illustrating the morphological mismatch challenge.
Figure 2: Filtering process and percentage of rows in the original dataset
Figure 3: The $F1@6$ score for exact (stronger color) and partial (lighter color) matches is shown for the baseline models, comparing Zelinka and SlovKE.
Figure 4: F1 score of exact and partial match for baseline models using different $k$ values, where $\mathcal{O}$ denotes $k = | \text{golden set} |$.
Figure 5: F1 score for KeyLLM with embeddings for thresholds 75, 85, 90, and without embeddings for two Sentence Transformers.
...and 6 more figures

SlovKE: A Large-Scale Dataset and LLM Evaluation for Slovak Keyphrase Extraction

Abstract

SlovKE: A Large-Scale Dataset and LLM Evaluation for Slovak Keyphrase Extraction

Authors

Abstract

Table of Contents

Figures (11)