Exploring Fine-tuned Generative Models for Keyphrase Selection: A Case Study for Russian
Anna Glazkova, Dmitry Morozov
TL;DR
The paper addresses automatic keyphrase generation for Russian scientific texts by fine-tuning four transformer-based generative models (ruT5, ruGPT, mT5, mBART) and evaluating them in-domain and across domains. It builds on data from Math&CS and Cyberleninka across historical, medical, and linguistic domains, comparing generative approaches to established baselines. The results show that in-domain, mBART yields the strongest gains across BERTScore, ROUGE-1, and F1, while cross-domain transfer is weaker but still competitive in some cases, underscoring the potential and challenges of cross-language keyphrase generation. The study highlights the practical potential of generative keyphrase generation for non-English corpora and suggests future work on cross-domain adaptation, user-controlled generation parameters, and instruction-based models to further improve performance and applicability.
Abstract
Keyphrase selection plays a pivotal role within the domain of scholarly texts, facilitating efficient information retrieval, summarization, and indexing. In this work, we explored how to apply fine-tuned generative transformer-based models to the specific task of keyphrase selection within Russian scientific texts. We experimented with four distinct generative models, such as ruT5, ruGPT, mT5, and mBART, and evaluated their performance in both in-domain and cross-domain settings. The experiments were conducted on the texts of Russian scientific abstracts from four domains: mathematics & computer science, history, medicine, and linguistics. The use of generative models, namely mBART, led to gains in in-domain performance (up to 4.9% in BERTScore, 9.0% in ROUGE-1, and 12.2% in F1-score) over three keyphrase extraction baselines for the Russian language. Although the results for cross-domain usage were significantly lower, they still demonstrated the capability to surpass baseline performances in several cases, underscoring the promising potential for further exploration and refinement in this research field.
