Table of Contents
Fetching ...

BERT's Conceptual Cartography: Mapping the Landscapes of Meaning

Nina Haket, Ryan Daniels

TL;DR

This work operationalizes Conceptual Engineering by constructing conceptual landscapes that map the pragmatic usage of words through BERT-based contextual embeddings drawn from the Spoken British National Corpus. It combines PCA dimensionality reduction, Gaussian Mixture Models, and a suite of metrics (MEV, self-similarity, intra- and inter-group similarity) with qualitative analysis to reveal word-specific, context-driven landscapes. The findings show substantial variability across words and even within a single lemma, underscoring the need for word-by-word CE strategies rather than one-size-fits-all approaches. The methodology offers a framework for quantifying lexical landscapes that can inform ethical language design and downstream NLP tasks like bias detection and sentiment analysis.

Abstract

Conceptual Engineers want to make words better. However, they often underestimate how varied our usage of words is. In this paper, we take the first steps in exploring the contextual nuances of words by creating conceptual landscapes -- 2D surfaces representing the pragmatic usage of words -- that conceptual engineers can use to inform their projects. We use the spoken component of the British National Corpus and BERT to create contextualised word embeddings, and use Gaussian Mixture Models, a selection of metrics, and qualitative analysis to visualise and numerically represent lexical landscapes. Such an approach has not yet been used in the conceptual engineering literature and provides a detailed examination of how different words manifest in various contexts that is potentially useful to conceptual engineering projects. Our findings highlight the inherent complexity of conceptual engineering, revealing that each word exhibits a unique and intricate landscape. Conceptual Engineers cannot, therefore, use a one-size-fits-all approach when improving words -- a task that may be practically intractable at scale.

BERT's Conceptual Cartography: Mapping the Landscapes of Meaning

TL;DR

This work operationalizes Conceptual Engineering by constructing conceptual landscapes that map the pragmatic usage of words through BERT-based contextual embeddings drawn from the Spoken British National Corpus. It combines PCA dimensionality reduction, Gaussian Mixture Models, and a suite of metrics (MEV, self-similarity, intra- and inter-group similarity) with qualitative analysis to reveal word-specific, context-driven landscapes. The findings show substantial variability across words and even within a single lemma, underscoring the need for word-by-word CE strategies rather than one-size-fits-all approaches. The methodology offers a framework for quantifying lexical landscapes that can inform ethical language design and downstream NLP tasks like bias detection and sentiment analysis.

Abstract

Conceptual Engineers want to make words better. However, they often underestimate how varied our usage of words is. In this paper, we take the first steps in exploring the contextual nuances of words by creating conceptual landscapes -- 2D surfaces representing the pragmatic usage of words -- that conceptual engineers can use to inform their projects. We use the spoken component of the British National Corpus and BERT to create contextualised word embeddings, and use Gaussian Mixture Models, a selection of metrics, and qualitative analysis to visualise and numerically represent lexical landscapes. Such an approach has not yet been used in the conceptual engineering literature and provides a detailed examination of how different words manifest in various contexts that is potentially useful to conceptual engineering projects. Our findings highlight the inherent complexity of conceptual engineering, revealing that each word exhibits a unique and intricate landscape. Conceptual Engineers cannot, therefore, use a one-size-fits-all approach when improving words -- a task that may be practically intractable at scale.
Paper Structure (32 sections, 4 equations, 9 figures)

This paper contains 32 sections, 4 equations, 9 figures.

Figures (9)

  • Figure 1: An example of how the target word brown is turned into a contextual embedding, e. For a target word the $C$ tokens before and after $w$ are input to BERT. The final embedding $e$ for the target word is then the $w\textsuperscript{th}$ row of the embedding matrix $X$ output from the final hidden layer. A collection of embeddings taken from $n$ sentences are then collated into the matrix $E$, which is then reduced to 2D and fitted to a GMM.
  • Figure 2: The Silhouette scores (a), optimal number of principal components (b), and optimal number of clusters (c) for each GMM fit to each word. Bold lines indicate averages, and shaded regions indicate the standard deviation.
  • Figure 3: (a) Anisotropy-corrected self-similarity (red) and maximum explained variance (blue). (b) Intra- (solid line) and inter-group (dashed line) similarity for the optimal number of principal components (red), and for 2 principal components (blue). (c) ARI for 1000 GMMs fitted to the optimal number of principal components (red), and for 2 principal components (blue). Error bars are the standard deviations.
  • Figure 4: The conceptual landscapes generated using the negative log-likelihood of the GMM predictions in 2D for (a) duty with 5 clusters, (b) theory with 3 clusters, and (c) planet with 4 clusters.
  • Figure 5: Qualitative inspection of the conceptual landscapes for (a) duty, (b) planet with 4 clusters, and (c) marriage with 9 clusters.
  • ...and 4 more figures