Table of Contents
Fetching ...

Probing Context Localization of Polysemous Words in Pre-trained Language Model Sub-Layers

Soniya Vijayakumar, Josef van Genabith, Simon Ostermann

TL;DR

The main conclusion is cautionary: BERT demonstrates a high degree of contextualization in the top sub-layers if the word in question is in a specific position in the sentence with a shorter context window, but this does not systematically generalize across different word positions and context sizes.

Abstract

In the era of high performing Large Language Models, researchers have widely acknowledged that contextual word representations are one of the key drivers in achieving top performances in downstream tasks. In this work, we investigate the degree of contextualization encoded in the fine-grained sub-layer representations of a Pre-trained Language Model (PLM) by empirical experiments using linear probes. Unlike previous work, we are particularly interested in identifying the strength of contextualization across PLM sub-layer representations (i.e. Self-Attention, Feed-Forward Activation and Output sub-layers). To identify the main contributions of sub-layers to contextualisation, we first extract the sub-layer representations of polysemous words in minimally different sentence pairs, and compare how these representations change through the forward pass of the PLM network. Second, by probing on a sense identification classification task, we try to empirically localize the strength of contextualization information encoded in these sub-layer representations. With these probing experiments, we also try to gain a better understanding of the influence of context length and context richness on the degree of contextualization. Our main conclusion is cautionary: BERT demonstrates a high degree of contextualization in the top sub-layers if the word in question is in a specific position in the sentence with a shorter context window, but this does not systematically generalize across different word positions and context sizes.

Probing Context Localization of Polysemous Words in Pre-trained Language Model Sub-Layers

TL;DR

The main conclusion is cautionary: BERT demonstrates a high degree of contextualization in the top sub-layers if the word in question is in a specific position in the sentence with a shorter context window, but this does not systematically generalize across different word positions and context sizes.

Abstract

In the era of high performing Large Language Models, researchers have widely acknowledged that contextual word representations are one of the key drivers in achieving top performances in downstream tasks. In this work, we investigate the degree of contextualization encoded in the fine-grained sub-layer representations of a Pre-trained Language Model (PLM) by empirical experiments using linear probes. Unlike previous work, we are particularly interested in identifying the strength of contextualization across PLM sub-layer representations (i.e. Self-Attention, Feed-Forward Activation and Output sub-layers). To identify the main contributions of sub-layers to contextualisation, we first extract the sub-layer representations of polysemous words in minimally different sentence pairs, and compare how these representations change through the forward pass of the PLM network. Second, by probing on a sense identification classification task, we try to empirically localize the strength of contextualization information encoded in these sub-layer representations. With these probing experiments, we also try to gain a better understanding of the influence of context length and context richness on the degree of contextualization. Our main conclusion is cautionary: BERT demonstrates a high degree of contextualization in the top sub-layers if the word in question is in a specific position in the sentence with a shorter context window, but this does not systematically generalize across different word positions and context sizes.
Paper Structure (21 sections, 3 equations, 6 figures, 2 tables)

This paper contains 21 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: a) Extraction of contextualized word sub-layer latent representations from a BERT encoder layer: From each BERT encoder layer, the Self-Attention (SA), Feed-Forward Activation (Acts) and Output sub-layer contextualized representations are extracted. b) Example Sentences in the CPWS - Contextualised Polysemy Word Sense v2 Dataset and PWC - Polysemous Word Complexity Dataset.
  • Figure 2: Pair-wise Polysemous Word Average Cosine Similarity a) CPWS and, b) sPWC Dataset pair-wise average cosine similarity for Self-Attention (SA), Activation (Acts) and Output (output) sub-layers.
  • Figure 3: Static-Embeddings Average Cosine Similarity a) CPWS Dataset and, b) sPWC Dataset static embeddings average cosine similarity for Self-Attention (SA), Activation (Acts) and Output (output) sub-layers.
  • Figure 4: Linear Sense Probes: Logistic Regression (LR) and Support Vector Machine (SVM) Linear Classification Accuracies: a) LR and b) SVM BERT layer-wise linear sense probe accuracies on CPWS Dataset for Self-Attention (SA), Activation (Acts) and Output (output) sub-layers.
  • Figure 5: Linear Sense Probes: Logistic Regression (LR) and Support Vector Machine (SVM) Linear Classification Accuracies: a,b) sPWC Dataset and c,d) PWC Dataset BERT layer-wise linear sense probe accuracies for Self-Attention (SA), Activation (Acts) and Output (output) sub-layers.
  • ...and 1 more figures