Table of Contents
Fetching ...

A Comprehensive Analysis of Static Word Embeddings for Turkish

Karahan Sarıtaş, Cahid Arda Öz, Tunga Güngör

TL;DR

The paper conducts a comprehensive, cross-model analysis of static word embeddings for Turkish by converting contextual embeddings (ELMo, BERT) into static forms using Aggregate and X2Static methods, and by evaluating them alongside traditional static models (Word2Vec, FastText, GloVe). It performs extensive intrinsic (analogy and similarity) and extrinsic (sentiment analysis, PoS tagging, NER) evaluations on Turkish, using two large corpora for training and multiple domain datasets for tasks. The study finds that static embeddings derived via the X2Static BERT approach generally deliver the strongest performance across tasks, with Word2Vec and FastText providing strong results in semantic and morphologically rich contexts, while aggregation of contextual embeddings often underperforms. The work highlights the practicality of static, context-derived embeddings as energy-efficient alternatives to heavy contextual models and contributes a public Turkish embedding repository to support future research and applications.

Abstract

Word embeddings are fixed-length, dense and distributed word representations that are used in natural language processing (NLP) applications. There are basically two types of word embedding models which are non-contextual (static) models and contextual models. The former method generates a single embedding for a word regardless of its context, while the latter method produces distinct embeddings for a word based on the specific contexts in which it appears. There are plenty of works that compare contextual and non-contextual embedding models within their respective groups in different languages. However, the number of studies that compare the models in these two groups with each other is very few and there is no such study in Turkish. This process necessitates converting contextual embeddings into static embeddings. In this paper, we compare and evaluate the performance of several contextual and non-contextual models in both intrinsic and extrinsic evaluation settings for Turkish. We make a fine-grained comparison by analyzing the syntactic and semantic capabilities of the models separately. The results of the analyses provide insights about the suitability of different embedding models in different types of NLP tasks. We also build a Turkish word embedding repository comprising the embedding models used in this work, which may serve as a valuable resource for researchers and practitioners in the field of Turkish NLP. We make the word embeddings, scripts, and evaluation datasets publicly available.

A Comprehensive Analysis of Static Word Embeddings for Turkish

TL;DR

The paper conducts a comprehensive, cross-model analysis of static word embeddings for Turkish by converting contextual embeddings (ELMo, BERT) into static forms using Aggregate and X2Static methods, and by evaluating them alongside traditional static models (Word2Vec, FastText, GloVe). It performs extensive intrinsic (analogy and similarity) and extrinsic (sentiment analysis, PoS tagging, NER) evaluations on Turkish, using two large corpora for training and multiple domain datasets for tasks. The study finds that static embeddings derived via the X2Static BERT approach generally deliver the strongest performance across tasks, with Word2Vec and FastText providing strong results in semantic and morphologically rich contexts, while aggregation of contextual embeddings often underperforms. The work highlights the practicality of static, context-derived embeddings as energy-efficient alternatives to heavy contextual models and contributes a public Turkish embedding repository to support future research and applications.

Abstract

Word embeddings are fixed-length, dense and distributed word representations that are used in natural language processing (NLP) applications. There are basically two types of word embedding models which are non-contextual (static) models and contextual models. The former method generates a single embedding for a word regardless of its context, while the latter method produces distinct embeddings for a word based on the specific contexts in which it appears. There are plenty of works that compare contextual and non-contextual embedding models within their respective groups in different languages. However, the number of studies that compare the models in these two groups with each other is very few and there is no such study in Turkish. This process necessitates converting contextual embeddings into static embeddings. In this paper, we compare and evaluate the performance of several contextual and non-contextual models in both intrinsic and extrinsic evaluation settings for Turkish. We make a fine-grained comparison by analyzing the syntactic and semantic capabilities of the models separately. The results of the analyses provide insights about the suitability of different embedding models in different types of NLP tasks. We also build a Turkish word embedding repository comprising the embedding models used in this work, which may serve as a valuable resource for researchers and practitioners in the field of Turkish NLP. We make the word embeddings, scripts, and evaluation datasets publicly available.
Paper Structure (13 sections, 10 equations, 1 figure, 10 tables)

This paper contains 13 sections, 10 equations, 1 figure, 10 tables.

Figures (1)

  • Figure 1: Word2Vec embeddings were employed for visualizing word similarities on the left, while FastText embeddings were utilized for word analogies on the right, both employing PCA to project vectors into a two-dimensional embedding space.