Table of Contents
Fetching ...

The Shape of Word Embeddings: Quantifying Non-Isometry With Topological Data Analysis

Ondřej Draganov, Steven Skiena

TL;DR

This work asks whether the global shape of unlabeled, $d$-dimensional word embeddings encodes language history. It applies persistent homology to token clouds from $81$ Indo-European$ languages, producing multiple language-distance matrices via distances between $0$-, $1$-, and $2$-dimensional persistence diagrams using metrics such as the Bottleneck, (Sliced) Wasserstein, Persistence Image, and Bars statistics. These distances feed two phylogenetic reconstruction methods, UPGMA and Neighbor Joining, and are evaluated against the Ethnologue reference tree through permutation tests across six tree-distance metrics, across $24$ parameter variants. The results show significant alignment for many configurations (e.g., $484$ of $864$ cases), indicating that embedding shapes carry a real signal of linguistic history and that topological data analysis can provide novel insights into language-space structure and cross-language relationships.

Abstract

Word embeddings represent language vocabularies as clouds of $d$-dimensional points. We investigate how information is conveyed by the general shape of these clouds, instead of representing the semantic meaning of each token. Specifically, we use the notion of persistent homology from topological data analysis (TDA) to measure the distances between language pairs from the shape of their unlabeled embeddings. These distances quantify the degree of non-isometry of the embeddings. To distinguish whether these differences are random training errors or capture real information about the languages, we use the computed distance matrices to construct language phylogenetic trees over 81 Indo-European languages. Careful evaluation shows that our reconstructed trees exhibit strong and statistically-significant similarities to the reference.

The Shape of Word Embeddings: Quantifying Non-Isometry With Topological Data Analysis

TL;DR

This work asks whether the global shape of unlabeled, -dimensional word embeddings encodes language history. It applies persistent homology to token clouds from Indo-European01224484864$ cases), indicating that embedding shapes carry a real signal of linguistic history and that topological data analysis can provide novel insights into language-space structure and cross-language relationships.

Abstract

Word embeddings represent language vocabularies as clouds of -dimensional points. We investigate how information is conveyed by the general shape of these clouds, instead of representing the semantic meaning of each token. Specifically, we use the notion of persistent homology from topological data analysis (TDA) to measure the distances between language pairs from the shape of their unlabeled embeddings. These distances quantify the degree of non-isometry of the embeddings. To distinguish whether these differences are random training errors or capture real information about the languages, we use the computed distance matrices to construct language phylogenetic trees over 81 Indo-European languages. Careful evaluation shows that our reconstructed trees exhibit strong and statistically-significant similarities to the reference.
Paper Structure (37 sections, 4 equations, 9 figures, 4 tables)

This paper contains 37 sections, 4 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Unions of disks centered at data points in a Euclidean plane define a 'shape' that varies with the radius of the disks. In our setting, each point is one token of a word embedding, and the space is $\mathbb{R}^{300}$ rather than $\mathbb{R}^2$. Persistent homology studies different types of 'gaps' within this growing shape---$0$-dimensional gaps between components of connectivity, and $1$-dimensional holes within the shape. Each such feature is born at some radius, and dies at some later radius, spanning an interval $[r_b, r_d)$. Those intervals are summarised in a persistence diagram; each interval is represented by a dot with coordinates $(r_b, r_d)$. The left and middle figures correspond to birth and death radii of the loop in the lower-right corner: $[r_b, r_d)=[0.072, 0.119)$. Overlaid is the Vietoris-Rips complex discretizing the pink shape to allow for computations (see \ref{['sec:vietoris-rips']}). On the right is the persistence diagram of the growing disks with highlighted blocks corresponding to radii $r_b$, $r_d$.
  • Figure 2: The statistical significance of TDA trees for 30, 50, and 81 languages against the Ethnologue reference, $E$, for trees reconstructed by UPGMA and NJ for each combination of parameters described in Section \ref{['sec:methods']}. Each dot represents a single reconstructed tree, $T$, and a tree distance, $D$. We performed 100,000 random permutations of the leaves of $T$, and compared each to the reference $E$ using the distance $D$. This yields a distribution with mean $\mu$ and standard deviation $\sigma$. To evaluate the reconstruction $T$, we plot $(\mu - D(T,E)) / \sigma$. The higher the value, the better the reconstruction. A star inside a dot signifies that $D(T,E)$ is smaller than 95,500 of the permuted tree distances.
  • Figure 3: The fraction of parameter combinations (out of 288) in Figure \ref{['fig:sd-analysis-tda-tree-permutations']} bested by $\leq p\cdot 100,\!000$ permutations.
  • Figure 4: The statistical significance of the labelings of the Ethnologue tree that optimize distance matrix correlation (see Section \ref{['sec:preserving-ethnologue-topology']}), for each combination of parameters described in Section \ref{['sec:methods']}. Each dot represents the distance of a single reconstructed labeling to the reference Ethnologue tree, and its position shows how many standard deviations away from the mean it lies in a distribution of 100,000 random labelings of the Ethnologue tree.
  • Figure 5: The fraction of parameter combinations (out of 288) in Figure \ref{['fig:sd-analysis-tda-tree-permutations']} further away from the mean than $\sigma$ standard deviations.
  • ...and 4 more figures