Table of Contents
Fetching ...

Compressing and Interpreting Word Embeddings with Latent Space Regularization and Interactive Semantics Probing

Haoyu Li, Junpeng Wang, Yan Zheng, Liang Wang, Wei Zhang, Han-Wei Shen

TL;DR

This work designs a visual analytics system to monitor the regularization process, explore the HD latent space, and interpret latent dimensions’ semantics, and shows that each dimension of the regularized latent space is more semantically salient, and validate the effectiveness of the embedding regularization and interpretation approach.

Abstract

Word embedding, a high-dimensional (HD) numerical representation of words generated by machine learning models, has been used for different natural language processing tasks, e.g., translation between two languages. Recently, there has been an increasing trend of transforming the HD embeddings into a latent space (e.g., via autoencoders) for further tasks, exploiting various merits the latent representations could bring. To preserve the embeddings' quality, these works often map the embeddings into an even higher-dimensional latent space, making the already complicated embeddings even less interpretable and consuming more storage space. In this work, we borrow the idea of $β$VAE to regularize the HD latent space. Our regularization implicitly condenses information from the HD latent space into a much lower-dimensional space, thus compressing the embeddings. We also show that each dimension of our regularized latent space is more semantically salient, and validate our assertion by interactively probing the encoding-level of user-proposed semantics in the dimensions. To the end, we design a visual analytics system to monitor the regularization process, explore the HD latent space, and interpret latent dimensions' semantics. We validate the effectiveness of our embedding regularization and interpretation approach through both quantitative and qualitative evaluations.

Compressing and Interpreting Word Embeddings with Latent Space Regularization and Interactive Semantics Probing

TL;DR

This work designs a visual analytics system to monitor the regularization process, explore the HD latent space, and interpret latent dimensions’ semantics, and shows that each dimension of the regularized latent space is more semantically salient, and validate the effectiveness of the embedding regularization and interpretation approach.

Abstract

Word embedding, a high-dimensional (HD) numerical representation of words generated by machine learning models, has been used for different natural language processing tasks, e.g., translation between two languages. Recently, there has been an increasing trend of transforming the HD embeddings into a latent space (e.g., via autoencoders) for further tasks, exploiting various merits the latent representations could bring. To preserve the embeddings' quality, these works often map the embeddings into an even higher-dimensional latent space, making the already complicated embeddings even less interpretable and consuming more storage space. In this work, we borrow the idea of VAE to regularize the HD latent space. Our regularization implicitly condenses information from the HD latent space into a much lower-dimensional space, thus compressing the embeddings. We also show that each dimension of our regularized latent space is more semantically salient, and validate our assertion by interactively probing the encoding-level of user-proposed semantics in the dimensions. To the end, we design a visual analytics system to monitor the regularization process, explore the HD latent space, and interpret latent dimensions' semantics. We validate the effectiveness of our embedding regularization and interpretation approach through both quantitative and qualitative evaluations.
Paper Structure (33 sections, 1 equation, 13 figures, 2 tables)

This paper contains 33 sections, 1 equation, 13 figures, 2 tables.

Figures (13)

  • Figure 1: (a) AE encodes each instance into a latent vector, i.e., 3 scalars here, since the latent space is in 3D. (b) VAE encodes each instance into a set of Gaussian distributions (each is parameterized by a mean and a variance value). A latent vector is then sampled from them.
  • Figure 2: LNMap mohiuddin2020lnmap aligns two languages' embeddings (200k English and Spanish words) by (1) transferring each into a latent space through an AE and (2) aligning the two AEs' latent space. The input and latent space have 300 and 350 dimensions respectively.
  • Figure 3: (a) Probing the encoding-level of the gender semantics (represented by man-woman) in the first latent dimension. (b) $(\theta{+}\phi){/}2$ denotes the encoding-level of gender semantics in the dimension.
  • Figure 4: Our system consists of four views, (a) the Model Evolution View presents five training dynamics reflecting embeddings' quality and regularization scale, (b) the Dimension Exploration View employs a zoomable PCP and customized glyphs to present latent dimensions, allowing users to probe the encoding-level of individual dimensions, (c) the Projection, and (d) Word Cloud View disclose details of the selected dimension (e.g., semantics extension, latent space density) and relate it to the words' semantics.
  • Figure 5: A glyph reflects the angle between the directions of the semantics (black) and regressed direction (red). The glyph's radius encodes the extent of the samples along the regressed direction. (a) and (b) represent useful and deprecated dimensions, respectively.
  • ...and 8 more figures