Table of Contents
Fetching ...

Injecting Wiktionary to improve token-level contextual representations using contrastive learning

Anna Mosolova, Marie Candito, Carlos Ramisch

TL;DR

The paper addresses lexical sense disambiguation limitations in context-rich token representations by injecting supervision from a crowd-sourced lexicon, Wiktionary, into contrastive learning. It proposes a lexicon-guided, supervised CL objective that leverages Wiktionary example sentences and applies optional PCA whitening to address high-dimensional embedding spaces. Empirically, the method achieves a new state-of-the-art in unsupervised WiC on OrigWiC and shows consistent gains across WiC variants, with additional, though smaller, benefits for semantic frame induction. The approach is designed to be multilingual, enabling lexical-semantic strengthening in languages with large Wiktionaries and offering a practical path to more sense-aware token representations for downstream tasks.

Abstract

While static word embeddings are blind to context, for lexical semantics tasks context is rather too present in contextual word embeddings, vectors of same-meaning occurrences being too different (Ethayarajh, 2019). Fine-tuning pre-trained language models (PLMs) using contrastive learning was proposed, leveraging automatically self-augmented examples (Liu et al., 2021b). In this paper, we investigate how to inject a lexicon as an alternative source of supervision, using the English Wiktionary. We also test how dimensionality reduction impacts the resulting contextual word embeddings. We evaluate our approach on the Word-In-Context (WiC) task, in the unsupervised setting (not using the training set). We achieve new SoTA result on the original WiC test set. We also propose two new WiC test sets for which we show that our fine-tuning method achieves substantial improvements. We also observe improvements, although modest, for the semantic frame induction task. Although we experimented on English to allow comparison with related work, our method is adaptable to the many languages for which large Wiktionaries exist.

Injecting Wiktionary to improve token-level contextual representations using contrastive learning

TL;DR

The paper addresses lexical sense disambiguation limitations in context-rich token representations by injecting supervision from a crowd-sourced lexicon, Wiktionary, into contrastive learning. It proposes a lexicon-guided, supervised CL objective that leverages Wiktionary example sentences and applies optional PCA whitening to address high-dimensional embedding spaces. Empirically, the method achieves a new state-of-the-art in unsupervised WiC on OrigWiC and shows consistent gains across WiC variants, with additional, though smaller, benefits for semantic frame induction. The approach is designed to be multilingual, enabling lexical-semantic strengthening in languages with large Wiktionaries and offering a practical path to more sense-aware token representations for downstream tasks.

Abstract

While static word embeddings are blind to context, for lexical semantics tasks context is rather too present in contextual word embeddings, vectors of same-meaning occurrences being too different (Ethayarajh, 2019). Fine-tuning pre-trained language models (PLMs) using contrastive learning was proposed, leveraging automatically self-augmented examples (Liu et al., 2021b). In this paper, we investigate how to inject a lexicon as an alternative source of supervision, using the English Wiktionary. We also test how dimensionality reduction impacts the resulting contextual word embeddings. We evaluate our approach on the Word-In-Context (WiC) task, in the unsupervised setting (not using the training set). We achieve new SoTA result on the original WiC test set. We also propose two new WiC test sets for which we show that our fine-tuning method achieves substantial improvements. We also observe improvements, although modest, for the semantic frame induction task. Although we experimented on English to allow comparison with related work, our method is adaptable to the many languages for which large Wiktionaries exist.
Paper Structure (17 sections, 1 equation, 5 tables)