Table of Contents
Fetching ...

Dissecting Contextual Word Embeddings: Architecture and Representation

Matthew E. Peters, Mark Neumann, Luke Zettlemoyer, Wen-tau Yih

TL;DR

The paper conducts a comprehensive empirical analysis of contextual word representations derived from bidirectional language models, comparing LSTM, Transformer, and gated CNN architectures. It demonstrates that all architectures produce high-quality contextual embeddings that outperform non-contextual word vectors across multiple NLP tasks, while revealing a depth-dependent hierarchy: morphology at the embedding layer, local syntax in lower contextual layers, and long-range semantics such as coreference in upper layers. The study introduces ELMo-style pooling to combine layer representations, and provides extensive probing (pos tagging, parsing, coreference) to show how syntactic and semantic information is distributed across layers. Overall, biLMs emerge as versatile, architecture-agnostic feature extractors capable of enhancing diverse NLP tasks without task-specific supervision beyond downstream models.

Abstract

Contextual word representations derived from pre-trained bidirectional language models (biLMs) have recently been shown to provide significant improvements to the state of the art for a wide range of NLP tasks. However, many questions remain as to how and why these models are so effective. In this paper, we present a detailed empirical study of how the choice of neural architecture (e.g. LSTM, CNN, or self attention) influences both end task accuracy and qualitative properties of the representations that are learned. We show there is a tradeoff between speed and accuracy, but all architectures learn high quality contextual representations that outperform word embeddings for four challenging NLP tasks. Additionally, all architectures learn representations that vary with network depth, from exclusively morphological based at the word embedding layer through local syntax based in the lower contextual layers to longer range semantics such coreference at the upper layers. Together, these results suggest that unsupervised biLMs, independent of architecture, are learning much more about the structure of language than previously appreciated.

Dissecting Contextual Word Embeddings: Architecture and Representation

TL;DR

The paper conducts a comprehensive empirical analysis of contextual word representations derived from bidirectional language models, comparing LSTM, Transformer, and gated CNN architectures. It demonstrates that all architectures produce high-quality contextual embeddings that outperform non-contextual word vectors across multiple NLP tasks, while revealing a depth-dependent hierarchy: morphology at the embedding layer, local syntax in lower contextual layers, and long-range semantics such as coreference in upper layers. The study introduces ELMo-style pooling to combine layer representations, and provides extensive probing (pos tagging, parsing, coreference) to show how syntactic and semantic information is distributed across layers. Overall, biLMs emerge as versatile, architecture-agnostic feature extractors capable of enhancing diverse NLP tasks without task-specific supervision beyond downstream models.

Abstract

Contextual word representations derived from pre-trained bidirectional language models (biLMs) have recently been shown to provide significant improvements to the state of the art for a wide range of NLP tasks. However, many questions remain as to how and why these models are so effective. In this paper, we present a detailed empirical study of how the choice of neural architecture (e.g. LSTM, CNN, or self attention) influences both end task accuracy and qualitative properties of the representations that are learned. We show there is a tradeoff between speed and accuracy, but all architectures learn high quality contextual representations that outperform word embeddings for four challenging NLP tasks. Additionally, all architectures learn representations that vary with network depth, from exclusively morphological based at the word embedding layer through local syntax based in the lower contextual layers to longer range semantics such coreference at the upper layers. Together, these results suggest that unsupervised biLMs, independent of architecture, are learning much more about the structure of language than previously appreciated.

Paper Structure

This paper contains 38 sections, 2 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Visualization of contextual similarity between all word pairs in a single sentence using the 4-layer LSTM. The left panel uses context vectors from the bottom LSTM layer while the right panel uses the top LSTM layer. Lighter yellow-colored areas have higher contextual similarity.
  • Figure 2: t-SNE visualization of 3K random chunks and 500 unlabeled spans ("NULL") from the CoNLL 2000 chunking dataset.
  • Figure 3: Various methods of probing the information stored in context vectors of deep biLMs. Each panel shows the results for all layers from a single biLM, with the first layer of contextual representations at the bottom and last layer at the top. From top to bottom, the figure shows results from the 4-layer LSTM, the Transformer and Gated CNN models. From left to right, the figure shows linear POS tagging accuracy (%; Sec. \ref{['sec:linear_probes']}), linear constituency parsing (F$_1$; Sec. \ref{['sec:linear_probes']}), and unsupervised pronominal coreference accuracy (%; Sec. \ref{['sec:visualize']}).
  • Figure 4: Normalized layer weights $\mathbf{s}$ for the tasks in Sec. \ref{['sec:elmo_eval']}. The vertical axis indexes the layer in the biLM, with layer 0 the word embedding $\mathbf{x}_k$.
  • Figure 5: Visualization of contextual similarities from the 4-layer LSTM biLM. The first layer is at top left and last layer at bottom right, with the layer indices increasing from left to right and top to bottom in the image.
  • ...and 2 more figures