Tabular Learning: Encoding for Entity and Context Embeddings
Fredy Reusser
TL;DR
This work investigates how different encoding techniques for categorical features influence entity and context embeddings in tabular learning. By discretizing continuous features with a decision-tree-based method and evaluating both an entity embedding model and a Transformer-based context model across 10 UCI datasets, it benchmarks multiple encoders beyond the common ordinal approach. The findings show that string similarity encoding often outperforms ordinal encoding, especially in multi-label tasks, albeit with higher training cost for high-cardinality features; the context model also benefits from alternative encodings. These results inform preprocessing choices for neural/tabular architectures and point to directions for future exploration of encoder effects on embeddings and model behavior.
Abstract
Examining the effect of different encoding techniques on entity and context embeddings, the goal of this work is to challenge commonly used Ordinal encoding for tabular learning. Applying different preprocessing methods and network architectures over several datasets resulted in a benchmark on how the encoders influence the learning outcome of the networks. By keeping the test, validation and training data consistent, results have shown that ordinal encoding is not the most suited encoder for categorical data in terms of preprocessing the data and thereafter, classifying the target variable correctly. A better outcome was achieved, encoding the features based on string similarities by computing a similarity matrix as input for the network. This is the case for both, entity and context embeddings, where the transformer architecture showed improved performance for Ordinal and Similarity encoding with regard to multi-label classification tasks.
