Table of Contents
Fetching ...

Autoencoder-based General Purpose Representation Learning for Customer Embedding

Jan Henrik Bertrand, David B. Hoffmann, Jacopo Pio Gargano, Laurent Mombaerts, Jonathan Taws

TL;DR

This work introduces DeepCAE, a multi-layer contractive autoencoder that applies the contractive regularization term via the full encoder Jacobian to learn robust, general-purpose embeddings for tabular entities. It pairs DeepCAE with a general entity embedding framework that accommodates multiple data modalities and downstream tasks, enabling evaluation across 13 public datasets. Empirically, DeepCAE outperforms other autoencoder variants in both reconstruction and downstream prediction, notably achieving a 34% reconstruction improvement over StackedCAE and competitive downstream performance, while KernelPCA can excel on downstream tasks despite weaker reconstruction. The study discusses trade-offs between model size, training time, and performance, highlighting practical implications for scalable deployment and suggesting directions for further enhancement of loss functions and efficiency.

Abstract

Recent advances in representation learning have successfully leveraged the underlying domain-specific structure of data across various fields. However, representing diverse and complex entities stored in tabular format within a latent space remains challenging. In this paper, we introduce DEEPCAE, a novel method for calculating the regularization term for multi-layer contractive autoencoders (CAEs). Additionally, we formalize a general-purpose entity embedding framework and use it to empirically show that DEEPCAE outperforms all other tested autoencoder variants in both reconstruction performance and downstream prediction performance. Notably, when compared to a stacked CAE across 13 datasets, DEEPCAE achieves a 34% improvement in reconstruction error.

Autoencoder-based General Purpose Representation Learning for Customer Embedding

TL;DR

This work introduces DeepCAE, a multi-layer contractive autoencoder that applies the contractive regularization term via the full encoder Jacobian to learn robust, general-purpose embeddings for tabular entities. It pairs DeepCAE with a general entity embedding framework that accommodates multiple data modalities and downstream tasks, enabling evaluation across 13 public datasets. Empirically, DeepCAE outperforms other autoencoder variants in both reconstruction and downstream prediction, notably achieving a 34% reconstruction improvement over StackedCAE and competitive downstream performance, while KernelPCA can excel on downstream tasks despite weaker reconstruction. The study discusses trade-offs between model size, training time, and performance, highlighting practical implications for scalable deployment and suggesting directions for further enhancement of loss functions and efficiency.

Abstract

Recent advances in representation learning have successfully leveraged the underlying domain-specific structure of data across various fields. However, representing diverse and complex entities stored in tabular format within a latent space remains challenging. In this paper, we introduce DEEPCAE, a novel method for calculating the regularization term for multi-layer contractive autoencoders (CAEs). Additionally, we formalize a general-purpose entity embedding framework and use it to empirically show that DEEPCAE outperforms all other tested autoencoder variants in both reconstruction performance and downstream prediction performance. Notably, when compared to a stacked CAE across 13 datasets, DEEPCAE achieves a 34% improvement in reconstruction error.
Paper Structure (33 sections, 7 equations, 7 figures, 22 tables)

This paper contains 33 sections, 7 equations, 7 figures, 22 tables.

Figures (7)

  • Figure 1: General-purpose embedding framework for multi-modal data and multiple downstream applications. Specific modalities such as text and time-series are embedded via specific methods, and then combined with other tabular data to be fed into the general embedding model. The resulting embedding is then optionally combined with labels, and used by downstream applications.
  • Figure 2: Mean Squared Error (MSE) of reconstruction across 13 datasets (see \ref{['app:benchmarking_datasets']}), normalized by KernelPCA as the non-linear baseline and aggregated by the geometric mean in logarithmic scale. See StackedCAE comparison in \ref{['fig:cae_comparison_recon']}. Error bars show a 95% confidence interval. Lower is better.
  • Figure 3: Comparison of stacked CAE and DeepCAE, normalized by KernelPCA as a non-linear baseline and aggregated by the geometric mean in logarithmic scale. Error bars show a 95% confidence interval. Lower is better.
  • Figure 4: Performance on downstream regression tasks across the regression datasets (see \ref{['app:benchmarking_datasets']}), normalized by the performance of a predictor trained on the raw data and aggregated by the geometric mean. See StackedCAE comparison in \ref{['fig:cae_comparison_down_regr']}. Lower is better.
  • Figure 5: Performance on downstream classification tasks across the classification datasets (see \ref{['app:benchmarking_datasets']}), normalized by the performance of a predictor trained on the raw data and aggregated by the geometric mean. See StackedCAE comparison in \ref{['fig:cae_comparison_down_class']}. Higher is better.
  • ...and 2 more figures