Extracting domain-specific terms using contextual word embeddings
Andraž Repar, Nada Lavrač, Senja Pollak
TL;DR
This work tackles automated terminology extraction for Slovenian, an under-resourced language, by integrating traditional linguistic and statistical cues with rich contextual word embeddings derived from eLMo Slovenian. It replaces fixed POS-pattern term candidate generation with a shallow, data-driven filter and trains a linear SVM on a unified feature set consisting of linguistic, statistical, and 1024-dimensional contextual features. Evaluated on the RSDO5 corpus across four domains, the approach achieves F1 scores of 0.530, 0.569, 0.561, and 0.594, outperforming prior Slovene methods and pattern-based baselines, which demonstrates the value of contextual representations for term extraction. The method is practical for real-world use, scales to modest corpus sizes, and can be extended to other languages with available domain corpora and embeddings, with future work including additional feature integration and experimentation with other contextual models like BERT.
Abstract
Automated terminology extraction refers to the task of extracting meaningful terms from domain-specific texts. This paper proposes a novel machine learning approach to terminology extraction, which combines features from traditional term extraction systems with novel contextual features derived from contextual word embeddings. Instead of using a predefined list of part-of-speech patterns, we first analyse a new term-annotated corpus RSDO5 for the Slovenian language and devise a set of rules for term candidate selection and then generate statistical, linguistic and context-based features. We use a support-vector machine algorithm to train a classification model, evaluate it on the four domains (biomechanics, linguistics, chemistry, veterinary) of the RSDO5 corpus and compare the results with state-of-art term extraction approaches for the Slovenian language. Our approach provides significant improvements in terms of F1 score over the previous state-of-the-art, which proves that contextual word embeddings are valuable for improving term extraction.
