Table of Contents
Fetching ...

Reconsidering Degeneration of Token Embeddings with Definitions for Encoder-based Pre-trained Language Models

Ying Zhang, Dongyuan Li, Manabu Okumura

TL;DR

This study first analyzes the fine-tuning dynamics of encoder-based PLMs and demonstrates their robustness against degeneration, and proposes DefinitionEMB, a method that utilizes definitions to re-construct isotropically distributed and semantics-related token embeddings for encoder-based PLMs while maintaining original robustness during fine-tuning.

Abstract

Learning token embeddings based on token co-occurrence statistics has proven effective for both pre-training and fine-tuning in natural language processing. However, recent studies have pointed out that the distribution of learned embeddings degenerates into anisotropy (i.e., non-uniform distribution), and even pre-trained language models (PLMs) suffer from a loss of semantics-related information in embeddings for low-frequency tokens. This study first analyzes the fine-tuning dynamics of encoder-based PLMs and demonstrates their robustness against degeneration. On the basis of this analysis, we propose DefinitionEMB, a method that utilizes definitions to re-construct isotropically distributed and semantics-related token embeddings for encoder-based PLMs while maintaining original robustness during fine-tuning. Our experiments demonstrate the effectiveness of leveraging definitions from Wiktionary to re-construct such embeddings for two encoder-based PLMs: RoBERTa-base and BART-large. Furthermore, the re-constructed embeddings for low-frequency tokens improve the performance of these models across various GLUE and four text summarization datasets.

Reconsidering Degeneration of Token Embeddings with Definitions for Encoder-based Pre-trained Language Models

TL;DR

This study first analyzes the fine-tuning dynamics of encoder-based PLMs and demonstrates their robustness against degeneration, and proposes DefinitionEMB, a method that utilizes definitions to re-construct isotropically distributed and semantics-related token embeddings for encoder-based PLMs while maintaining original robustness during fine-tuning.

Abstract

Learning token embeddings based on token co-occurrence statistics has proven effective for both pre-training and fine-tuning in natural language processing. However, recent studies have pointed out that the distribution of learned embeddings degenerates into anisotropy (i.e., non-uniform distribution), and even pre-trained language models (PLMs) suffer from a loss of semantics-related information in embeddings for low-frequency tokens. This study first analyzes the fine-tuning dynamics of encoder-based PLMs and demonstrates their robustness against degeneration. On the basis of this analysis, we propose DefinitionEMB, a method that utilizes definitions to re-construct isotropically distributed and semantics-related token embeddings for encoder-based PLMs while maintaining original robustness during fine-tuning. Our experiments demonstrate the effectiveness of leveraging definitions from Wiktionary to re-construct such embeddings for two encoder-based PLMs: RoBERTa-base and BART-large. Furthermore, the re-constructed embeddings for low-frequency tokens improve the performance of these models across various GLUE and four text summarization datasets.
Paper Structure (30 sections, 2 equations, 20 figures, 33 tables)

This paper contains 30 sections, 2 equations, 20 figures, 33 tables.

Figures (20)

  • Figure 1: Projected token embeddings of BART with and without DelDirection on the CNNDM and Y-BIGPATENT datasets. The x-axis and y-axis represent the right singular vectors associated with the largest and the second largest singular values, respectively. Appendix \ref{['appendix:projected_embeddings']} provides additional examples for BART and RoBERTa.
  • Figure 2: Case study of the token embeddings before fine-tuning on the CNNDM dataset. "Ġ" denotes whitespace. The dashed lines from "Ġeverlasting" point to its semantics-related tokens, recognized by both ChatGPT 3.5 achiam2023gpt and Claude 3 Haiku anth2024claude. Appendix \ref{['appendix:semantically_related_llm']} lists their recognitions.
  • Figure 3: Overview of constructing definition embeddings to replace last$\alpha$% of pre-trained embeddings.
  • Figure 4: Constructed prompts. Brackets [] are a placeholder for the given word and its corresponding information. Texts with the same color indicate positions of a prompt and corresponding word information. {corruption} indicates the span for corrupted tokens. The bpe-form without space refers to the word's surface-form without the symbol "Ġ" when using the BART's tokenizer. Appendix \ref{['appendix:corrupted_prompts']} lists detailed examples.
  • Figure 5: Projected token embeddings in BART+DefinitionEMB before and after fine-tuning. The embeddings in (a) and (c) exhibit different shapes due to the different $\alpha$.
  • ...and 15 more figures