GrEmLIn: A Repository of Green Baseline Embeddings for 87 Low-Resource Languages Injected with Multilingual Graph Knowledge
Daniil Gurgurov, Rishu Kumar, Simon Ostermann
TL;DR
GrEmLIn tackles the scarcity of high-quality static embeddings for 87 low- and mid-resource languages by enriching GloVe with multilingual graph knowledge from ConceptNet. The method merges GloVe with graph-derived PPMI signals via a truncated SVD and a subsequent linear projection, producing 300-dimensional embeddings for the full vocabulary. Empirically, graph-enhanced GloVe outperforms vanilla GloVe, FastText, and, in many cases, contextual models on lexical similarity and several downstream tasks (sentiment, NLI), while remaining a lightweight, parameter-free-at-inference baseline. This work demonstrates the enduring value of static, graph-informed embeddings as a scalable, environmentally friendly alternative for multilingual NLP in data-scarce settings, and makes the GrEmLIn resources publicly available on HuggingFace. The results underscore the importance of integrating structured multilingual knowledge to bolster static representations in low-resource languages, with room for further fusion and coverage improvements in future work.
Abstract
Contextualized embeddings based on large language models (LLMs) are available for various languages, but their coverage is often limited for lower resourced languages. Using LLMs for such languages is often difficult due to a high computational cost; not only during training, but also during inference. Static word embeddings are much more resource-efficient ("green"), and thus still provide value, particularly for very low-resource languages. There is, however, a notable lack of comprehensive repositories with such embeddings for diverse languages. To address this gap, we present GrEmLIn, a centralized repository of green, static baseline embeddings for 87 mid- and low-resource languages. We compute GrEmLIn embeddings with a novel method that enhances GloVe embeddings by integrating multilingual graph knowledge, which makes our static embeddings competitive with LLM representations, while being parameter-free at inference time. Our experiments demonstrate that GrEmLIn embeddings outperform state-of-the-art contextualized embeddings from E5 on the task of lexical similarity. They remain competitive in extrinsic evaluation tasks like sentiment analysis and natural language inference, with average performance gaps of just 5-10\% or less compared to state-of-the-art models, given a sufficient vocabulary overlap with the target task, and underperform only on topic classification. Our code and embeddings are publicly available at https://huggingface.co/DFKI.
