Table of Contents
Fetching ...

A Large-Scale Sensitivity Analysis on Latent Embeddings and Dimensionality Reductions for Text Spatializations

Daniel Atzberger, Tim Cech, Willy Scheibel, Jürgen Döllner, Michael Behrisch, Tobias Schreck

TL;DR

This work addresses how 2D text spatializations, derived from latent embeddings and dimensionality reductions, remain stable under data changes, hyperparameter tweaks, and randomness. It introduces a large-scale, two-stage sensitivity analysis across three corpora, six embeddings, and four DRs, using ten similarity metrics that are aggregated into local, global, and class-separation scores. The study finds that text embeddings generally enhance stability, with t-SNE paired with topic models delivering particularly robust results, and provides practical guidelines for embedding-DR selection. The results offer a reproducible benchmarking framework and actionable insights to improve the reliability and interpretability of text spatializations in practice.

Abstract

The semantic similarity between documents of a text corpus can be visualized using map-like metaphors based on two-dimensional scatterplot layouts. These layouts result from a dimensionality reduction on the document-term matrix or a representation within a latent embedding, including topic models. Thereby, the resulting layout depends on the input data and hyperparameters of the dimensionality reduction and is therefore affected by changes in them. Furthermore, the resulting layout is affected by changes in the input data and hyperparameters of the dimensionality reduction. However, such changes to the layout require additional cognitive efforts from the user. In this work, we present a sensitivity study that analyzes the stability of these layouts concerning (1) changes in the text corpora, (2) changes in the hyperparameter, and (3) randomness in the initialization. Our approach has two stages: data measurement and data analysis. First, we derived layouts for the combination of three text corpora and six text embeddings and a grid-search-inspired hyperparameter selection of the dimensionality reductions. Afterward, we quantified the similarity of the layouts through ten metrics, concerning local and global structures and class separation. Second, we analyzed the resulting 42817 tabular data points in a descriptive statistical analysis. From this, we derived guidelines for informed decisions on the layout algorithm and highlight specific hyperparameter settings. We provide our implementation as a Git repository at https://github.com/hpicgs/Topic-Models-and-Dimensionality-Reduction-Sensitivity-Study and results as Zenodo archive at https://doi.org/10.5281/zenodo.12772898.

A Large-Scale Sensitivity Analysis on Latent Embeddings and Dimensionality Reductions for Text Spatializations

TL;DR

This work addresses how 2D text spatializations, derived from latent embeddings and dimensionality reductions, remain stable under data changes, hyperparameter tweaks, and randomness. It introduces a large-scale, two-stage sensitivity analysis across three corpora, six embeddings, and four DRs, using ten similarity metrics that are aggregated into local, global, and class-separation scores. The study finds that text embeddings generally enhance stability, with t-SNE paired with topic models delivering particularly robust results, and provides practical guidelines for embedding-DR selection. The results offer a reproducible benchmarking framework and actionable insights to improve the reliability and interpretability of text spatializations in practice.

Abstract

The semantic similarity between documents of a text corpus can be visualized using map-like metaphors based on two-dimensional scatterplot layouts. These layouts result from a dimensionality reduction on the document-term matrix or a representation within a latent embedding, including topic models. Thereby, the resulting layout depends on the input data and hyperparameters of the dimensionality reduction and is therefore affected by changes in them. Furthermore, the resulting layout is affected by changes in the input data and hyperparameters of the dimensionality reduction. However, such changes to the layout require additional cognitive efforts from the user. In this work, we present a sensitivity study that analyzes the stability of these layouts concerning (1) changes in the text corpora, (2) changes in the hyperparameter, and (3) randomness in the initialization. Our approach has two stages: data measurement and data analysis. First, we derived layouts for the combination of three text corpora and six text embeddings and a grid-search-inspired hyperparameter selection of the dimensionality reductions. Afterward, we quantified the similarity of the layouts through ten metrics, concerning local and global structures and class separation. Second, we analyzed the resulting 42817 tabular data points in a descriptive statistical analysis. From this, we derived guidelines for informed decisions on the layout algorithm and highlight specific hyperparameter settings. We provide our implementation as a Git repository at https://github.com/hpicgs/Topic-Models-and-Dimensionality-Reduction-Sensitivity-Study and results as Zenodo archive at https://doi.org/10.5281/zenodo.12772898.
Paper Structure (24 sections, 7 equations, 5 figures, 5 tables)

This paper contains 24 sections, 7 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Exemplary comparison of pairs of scatterplots. To analyze the stability concerning input data, we compare pairs of scatterplots that only differ in the amount of jitter applied to the DTM. To analyze the stability concerning hyperparameters, we compare pairs of scatterplots that differ in one hyperparameter setting with consecutive values. To analyze stability concerning randomness, we compare two layouts that only differ in their seeds.
  • Figure 2: Heatmap showing the pairwise correlations between the similarity metrics using a diverging color scheme. We additionally show the correlation with the Silhouette Coefficient, which is another cluster separation metric. Metrics that correlate nearly perfect, i.e., $\alpha_T, \alpha_C, \alpha_{MM}, \alpha_{MF}$ as well as $\beta_{PC}, \beta_{SC}$ are considered as one metric by taking their averages. Note: the local and global similarity measures show a negative correlation to the class separation measures, as they have opposite optimums.
  • Figure 3: Results of the first experiment to quantify the stability concerning changes to the input data. The hue of each bar indicates the DR, and the intensity indicates the amount of jitter applied to the DTM. The metrics $\tilde{\alpha}$, $\tilde{\beta}$, and $\tilde{\gamma}$ quantify how well the layout algorithm adapts to changes to the DTM, with 1 being optimal. The visualization indicates that BERT, in combination with t-SNE, best reflects changes to the DTM concerning $\tilde{\alpha}$ and $\tilde{\beta}$, resulting in improvements compared to the VSM. Note: The vertical axis ranges differ between the three metrics $\tilde{\alpha}$, $\tilde{\beta}$, and $\tilde{\gamma}$.
  • Figure 4: Results of the second experiment to quantify the stability concerning hyperparameters. The hue of each bar indicates the DR and the intensity indicates a specific hyperparameter that is varied. LDA, LSI, and NMF, in combination with t-SNE, show the highest stability concerning changes to the hyperparameters. Note: The vertical axis ranges differ between the three metrics $\alpha$, $\beta$, and $\gamma$.
  • Figure 5: Results of the third experiment to quantify the stability concerning randomness. The color of each bar indicates the DR underlying the layout algorithm. Overall, LDA in combination with t-SNE shows the best results. Note: The vertical axis ranges differ between the three metrics $\alpha$, $\beta$, and $\gamma$.