Autoencoder-based General Purpose Representation Learning for Customer Embedding
Jan Henrik Bertrand, David B. Hoffmann, Jacopo Pio Gargano, Laurent Mombaerts, Jonathan Taws
TL;DR
This work introduces DeepCAE, a multi-layer contractive autoencoder that applies the contractive regularization term via the full encoder Jacobian to learn robust, general-purpose embeddings for tabular entities. It pairs DeepCAE with a general entity embedding framework that accommodates multiple data modalities and downstream tasks, enabling evaluation across 13 public datasets. Empirically, DeepCAE outperforms other autoencoder variants in both reconstruction and downstream prediction, notably achieving a 34% reconstruction improvement over StackedCAE and competitive downstream performance, while KernelPCA can excel on downstream tasks despite weaker reconstruction. The study discusses trade-offs between model size, training time, and performance, highlighting practical implications for scalable deployment and suggesting directions for further enhancement of loss functions and efficiency.
Abstract
Recent advances in representation learning have successfully leveraged the underlying domain-specific structure of data across various fields. However, representing diverse and complex entities stored in tabular format within a latent space remains challenging. In this paper, we introduce DEEPCAE, a novel method for calculating the regularization term for multi-layer contractive autoencoders (CAEs). Additionally, we formalize a general-purpose entity embedding framework and use it to empirically show that DEEPCAE outperforms all other tested autoencoder variants in both reconstruction performance and downstream prediction performance. Notably, when compared to a stacked CAE across 13 datasets, DEEPCAE achieves a 34% improvement in reconstruction error.
