Table of Contents
Fetching ...

Local Topology Measures of Contextual Language Model Latent Spaces With Applications to Dialogue Term Extraction

Benjamin Matthias Ruppik, Michael Heck, Carel van Niekerk, Renato Vukovic, Hsien-chin Lin, Shutong Feng, Marcus Zibrowius, Milica Gašić

TL;DR

This work tackles the limitation of conventional sequence tagging that ignores corpus‑level context by introducing local topological measures of the latent space around contextual embeddings. It defines neighborhoods around embeddings and computes persistent homology descriptors (persistence images and Wasserstein norms) plus codensity, forming topological features that are fused with LM vectors in a BIO tagging model. Empirical evaluation on dialogue term extraction shows that contextual topological features yield statistically significant improvements over a baseline LM‑only approach and outperform a static topological baseline, both in full data and transfer settings. The approach is reproducible, does not require access to the original feature generator, and holds promise for broader applicability, though it currently focuses on RoBERTa base with considerations for computational cost on larger corpora.

Abstract

A common approach for sequence tagging tasks based on contextual word representations is to train a machine learning classifier directly on these embedding vectors. This approach has two shortcomings. First, such methods consider single input sequences in isolation and are unable to put an individual embedding vector in relation to vectors outside the current local context of use. Second, the high performance of these models relies on fine-tuning the embedding model in conjunction with the classifier, which may not always be feasible due to the size or inaccessibility of the underlying feature-generation model. It is thus desirable, given a collection of embedding vectors of a corpus, i.e., a datastore, to find features of each vector that describe its relation to other, similar vectors in the datastore. With this in mind, we introduce complexity measures of the local topology of the latent space of a contextual language model with respect to a given datastore. The effectiveness of our features is demonstrated through their application to dialogue term extraction. Our work continues a line of research that explores the manifold hypothesis for word embeddings, demonstrating that local structure in the space carved out by word embeddings can be exploited to infer semantic properties.

Local Topology Measures of Contextual Language Model Latent Spaces With Applications to Dialogue Term Extraction

TL;DR

This work tackles the limitation of conventional sequence tagging that ignores corpus‑level context by introducing local topological measures of the latent space around contextual embeddings. It defines neighborhoods around embeddings and computes persistent homology descriptors (persistence images and Wasserstein norms) plus codensity, forming topological features that are fused with LM vectors in a BIO tagging model. Empirical evaluation on dialogue term extraction shows that contextual topological features yield statistically significant improvements over a baseline LM‑only approach and outperform a static topological baseline, both in full data and transfer settings. The approach is reproducible, does not require access to the original feature generator, and holds promise for broader applicability, though it currently focuses on RoBERTa base with considerations for computational cost on larger corpora.

Abstract

A common approach for sequence tagging tasks based on contextual word representations is to train a machine learning classifier directly on these embedding vectors. This approach has two shortcomings. First, such methods consider single input sequences in isolation and are unable to put an individual embedding vector in relation to vectors outside the current local context of use. Second, the high performance of these models relies on fine-tuning the embedding model in conjunction with the classifier, which may not always be feasible due to the size or inaccessibility of the underlying feature-generation model. It is thus desirable, given a collection of embedding vectors of a corpus, i.e., a datastore, to find features of each vector that describe its relation to other, similar vectors in the datastore. With this in mind, we introduce complexity measures of the local topology of the latent space of a contextual language model with respect to a given datastore. The effectiveness of our features is demonstrated through their application to dialogue term extraction. Our work continues a line of research that explores the manifold hypothesis for word embeddings, demonstrating that local structure in the space carved out by word embeddings can be exploited to infer semantic properties.
Paper Structure (25 sections, 1 equation, 2 figures, 3 tables)

This paper contains 25 sections, 1 equation, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Schematic illustration of the local topological feature extraction and of our topological deep learning pipeline: The blue box illustrates the extraction of neighborhoods $\mathcal{N}_n(v)$ in the contextualized embedding space, followed by the computation of each neighborhood's topological features, resulting in a contextualized persistence image vector. Note the color coding of the different occurrences of the token 'the'; contextuality leads to different language model embedding vectors and persistence images depending on whether it is part of the term 'Autry Museum of the American West' or used as a non-content word. For each token, the language model embedding (Emb) and persistence image vectors (PI) are encoded (E), combined ($\sum$), and then serve as input to our BIO-tagging transformer (green), which is trained on the token-level term labels (B-TERM (begin), I-TERM (inside), O (outside)).
  • Figure 2: Kendall's rank correlation coefficients between various local estimates and language model (LM) perplexity for the SGD test dataset. FT stands for LM fine-tuned on the MultiWOZ2.1 training split. All correlations have $p < 10^{-6}$.

Theorems & Definitions (1)

  • Definition 2.1