Reconsidering Degeneration of Token Embeddings with Definitions for Encoder-based Pre-trained Language Models

Ying Zhang; Dongyuan Li; Manabu Okumura

Reconsidering Degeneration of Token Embeddings with Definitions for Encoder-based Pre-trained Language Models

Ying Zhang, Dongyuan Li, Manabu Okumura

TL;DR

This study first analyzes the fine-tuning dynamics of encoder-based PLMs and demonstrates their robustness against degeneration, and proposes DefinitionEMB, a method that utilizes definitions to re-construct isotropically distributed and semantics-related token embeddings for encoder-based PLMs while maintaining original robustness during fine-tuning.

Abstract

Learning token embeddings based on token co-occurrence statistics has proven effective for both pre-training and fine-tuning in natural language processing. However, recent studies have pointed out that the distribution of learned embeddings degenerates into anisotropy (i.e., non-uniform distribution), and even pre-trained language models (PLMs) suffer from a loss of semantics-related information in embeddings for low-frequency tokens. This study first analyzes the fine-tuning dynamics of encoder-based PLMs and demonstrates their robustness against degeneration. On the basis of this analysis, we propose DefinitionEMB, a method that utilizes definitions to re-construct isotropically distributed and semantics-related token embeddings for encoder-based PLMs while maintaining original robustness during fine-tuning. Our experiments demonstrate the effectiveness of leveraging definitions from Wiktionary to re-construct such embeddings for two encoder-based PLMs: RoBERTa-base and BART-large. Furthermore, the re-constructed embeddings for low-frequency tokens improve the performance of these models across various GLUE and four text summarization datasets.

Reconsidering Degeneration of Token Embeddings with Definitions for Encoder-based Pre-trained Language Models

TL;DR

Abstract

Paper Structure (30 sections, 2 equations, 20 figures, 33 tables)

This paper contains 30 sections, 2 equations, 20 figures, 33 tables.

Introduction
Related Work
Preliminaries
Token Embedding Dynamics: An Experimental Investigation
Methodology
Embedding Construction
Objective Function
Replacing Strategy in Inference
Experiments
Experimental Settings
Quantitative Evaluation
Analysis of DefinitionEMB
Ablation Study
Embedding Dynamics
Conclusion
...and 15 more sections

Figures (20)

Figure 1: Projected token embeddings of BART with and without DelDirection on the CNNDM and Y-BIGPATENT datasets. The x-axis and y-axis represent the right singular vectors associated with the largest and the second largest singular values, respectively. Appendix \ref{['appendix:projected_embeddings']} provides additional examples for BART and RoBERTa.
Figure 2: Case study of the token embeddings before fine-tuning on the CNNDM dataset. "Ġ" denotes whitespace. The dashed lines from "Ġeverlasting" point to its semantics-related tokens, recognized by both ChatGPT 3.5 achiam2023gpt and Claude 3 Haiku anth2024claude. Appendix \ref{['appendix:semantically_related_llm']} lists their recognitions.
Figure 3: Overview of constructing definition embeddings to replace last$\alpha$% of pre-trained embeddings.
Figure 4: Constructed prompts. Brackets [] are a placeholder for the given word and its corresponding information. Texts with the same color indicate positions of a prompt and corresponding word information. {corruption} indicates the span for corrupted tokens. The bpe-form without space refers to the word's surface-form without the symbol "Ġ" when using the BART's tokenizer. Appendix \ref{['appendix:corrupted_prompts']} lists detailed examples.
Figure 5: Projected token embeddings in BART+DefinitionEMB before and after fine-tuning. The embeddings in (a) and (c) exhibit different shapes due to the different $\alpha$.
...and 15 more figures

Reconsidering Degeneration of Token Embeddings with Definitions for Encoder-based Pre-trained Language Models

TL;DR

Abstract

Reconsidering Degeneration of Token Embeddings with Definitions for Encoder-based Pre-trained Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (20)