Table of Contents
Fetching ...

Entity Embeddings of Categorical Variables

Cheng Guo, Felix Berkhahn

TL;DR

The paper tackles learning representations for high-cardinality categorical features in structured data. It introduces entity embeddings—vectors learned as part of standard supervised training—as an alternative to one-hot encoding. The embeddings improve neural networks and other models, enhance generalization in sparse data scenarios, and enable visualization and clustering of categorical variables. Empirical results on the Rossmann Kaggle competition accompany a discussion of the embeddings' relation to finite metric spaces and their practical impact for structured-data modeling.

Abstract

We map categorical variables in a function approximation problem into Euclidean spaces, which are the entity embeddings of the categorical variables. The mapping is learned by a neural network during the standard supervised training process. Entity embedding not only reduces memory usage and speeds up neural networks compared with one-hot encoding, but more importantly by mapping similar values close to each other in the embedding space it reveals the intrinsic properties of the categorical variables. We applied it successfully in a recent Kaggle competition and were able to reach the third position with relative simple features. We further demonstrate in this paper that entity embedding helps the neural network to generalize better when the data is sparse and statistics is unknown. Thus it is especially useful for datasets with lots of high cardinality features, where other methods tend to overfit. We also demonstrate that the embeddings obtained from the trained neural network boost the performance of all tested machine learning methods considerably when used as the input features instead. As entity embedding defines a distance measure for categorical variables it can be used for visualizing categorical data and for data clustering.

Entity Embeddings of Categorical Variables

TL;DR

The paper tackles learning representations for high-cardinality categorical features in structured data. It introduces entity embeddings—vectors learned as part of standard supervised training—as an alternative to one-hot encoding. The embeddings improve neural networks and other models, enhance generalization in sparse data scenarios, and enable visualization and clustering of categorical variables. Empirical results on the Rossmann Kaggle competition accompany a discussion of the embeddings' relation to finite metric spaces and their practical impact for structured-data modeling.

Abstract

We map categorical variables in a function approximation problem into Euclidean spaces, which are the entity embeddings of the categorical variables. The mapping is learned by a neural network during the standard supervised training process. Entity embedding not only reduces memory usage and speeds up neural networks compared with one-hot encoding, but more importantly by mapping similar values close to each other in the embedding space it reveals the intrinsic properties of the categorical variables. We applied it successfully in a recent Kaggle competition and were able to reach the third position with relative simple features. We further demonstrate in this paper that entity embedding helps the neural network to generalize better when the data is sparse and statistics is unknown. Thus it is especially useful for datasets with lots of high cardinality features, where other methods tend to overfit. We also demonstrate that the embeddings obtained from the trained neural network boost the performance of all tested machine learning methods considerably when used as the input features instead. As entity embedding defines a distance measure for categorical variables it can be used for visualizing categorical data and for data clustering.

Paper Structure

This paper contains 15 sections, 20 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Illustration that entity embedding layers are equivalent to extra layers on top of each one-hot encoded input.
  • Figure 2: Distance in the store embedding space versus distance in the metric space for 10000 random pair of stores.
  • Figure 3: The learned German state embedding is mapped to a 2D space with t-SNE. The relative positions of German states here resemble that on the real German map surprisingly well.
  • Figure 4: Sales distribution along first principal component (upper left) and second principal component (upper right) of embedded store indices and along two random directions (lower left and right). All $1115$ stores contributed to the plot.
  • Figure 5: Density distribution of embedded store indices along the first four principal components (from upper left to lower right). The red line corresponds to a gaussian fit. The p-values of the D'Agostino's $K^2$ normality test are all statistically significant, i.e. below $0.05$.