Table of Contents
Fetching ...

Comparative Study on the Performance of Categorical Variable Encoders in Classification and Regression Tasks

Wenbin Zhu, Runwen Qiu, Ying Fu

TL;DR

This work tackles how to encode categorical variables for classification and regression by classifying models into ATI, tree-based, and others, and by proving that the one-hot encoder is a universal encoder for ATI models, while target encoders excel with tree-based models. Theoretical results show that for a categorical variable with $c$ levels, there exists a linear mapping $h(x)=W_{OH}\phi_{OH}(x)$ that reproduces any encoder’s contribution $f(x)=W_{\phi}\phi(x)$ on ATI architectures. Empirically, it evaluates 14 encoders across 28 datasets with 8 models, revealing that data sufficiency quantified by ASPL/minASPL governs when one-hot or target encoders perform best, and that time-cost considerations matter. The findings translate into practical guidance for encoder selection in applications such as fraud detection and disease diagnosis, bridging theory and large-scale experiments.

Abstract

Categorical variables often appear in datasets for classification and regression tasks, and they need to be encoded into numerical values before training. Since many encoders have been developed and can significantly impact performance, choosing the appropriate encoder for a task becomes a time-consuming yet important practical issue. This study broadly classifies machine learning models into three categories: 1) ATI models that implicitly perform affine transformations on inputs, such as multi-layer perceptron neural network; 2) Tree-based models that are based on decision trees, such as random forest; and 3) the rest, such as kNN. Theoretically, we prove that the one-hot encoder is the best choice for ATI models in the sense that it can mimic any other encoders by learning suitable weights from the data. We also explain why the target encoder and its variants are the most suitable encoders for tree-based models. This study conducted comprehensive computational experiments to evaluate 14 encoders, including one-hot and target encoders, along with eight common machine-learning models on 28 datasets. The computational results agree with our theoretical analysis. The findings in this study shed light on how to select the suitable encoder for data scientists in fields such as fraud detection, disease diagnosis, etc.

Comparative Study on the Performance of Categorical Variable Encoders in Classification and Regression Tasks

TL;DR

This work tackles how to encode categorical variables for classification and regression by classifying models into ATI, tree-based, and others, and by proving that the one-hot encoder is a universal encoder for ATI models, while target encoders excel with tree-based models. Theoretical results show that for a categorical variable with levels, there exists a linear mapping that reproduces any encoder’s contribution on ATI architectures. Empirically, it evaluates 14 encoders across 28 datasets with 8 models, revealing that data sufficiency quantified by ASPL/minASPL governs when one-hot or target encoders perform best, and that time-cost considerations matter. The findings translate into practical guidance for encoder selection in applications such as fraud detection and disease diagnosis, bridging theory and large-scale experiments.

Abstract

Categorical variables often appear in datasets for classification and regression tasks, and they need to be encoded into numerical values before training. Since many encoders have been developed and can significantly impact performance, choosing the appropriate encoder for a task becomes a time-consuming yet important practical issue. This study broadly classifies machine learning models into three categories: 1) ATI models that implicitly perform affine transformations on inputs, such as multi-layer perceptron neural network; 2) Tree-based models that are based on decision trees, such as random forest; and 3) the rest, such as kNN. Theoretically, we prove that the one-hot encoder is the best choice for ATI models in the sense that it can mimic any other encoders by learning suitable weights from the data. We also explain why the target encoder and its variants are the most suitable encoders for tree-based models. This study conducted comprehensive computational experiments to evaluate 14 encoders, including one-hot and target encoders, along with eight common machine-learning models on 28 datasets. The computational results agree with our theoretical analysis. The findings in this study shed light on how to select the suitable encoder for data scientists in fields such as fraud detection, disease diagnosis, etc.
Paper Structure (25 sections, 1 theorem, 17 equations, 10 figures, 4 tables)

This paper contains 25 sections, 1 theorem, 17 equations, 10 figures, 4 tables.

Key Result

Theorem 1

For a categorical variable $\mathcal{V}=\{v_1,v_2,\dots,v_c\}$ and any encoder $\phi: \mathcal{V} \rightarrow \mathbb{R}^l$ that encodes any level $x \in \mathcal{V}$ into $l$ dimension vector $\phi(x)$, an arbitrary linear transformation of $\phi(x)$, $f(x) = \boldsymbol{W}_\phi\phi(x)$, there exis

Figures (10)

  • Figure 1: (a) A multi-layer neural network. A categorical variable, denoted as $x_1$ with a cardinality of $c$, is encoded by a generic encoder $\phi(x_1) = [\phi_1, \cdots, \phi_{l}]^T$ and the contribution of $x_1$ to input of first hidden layer is given by blue sub-network $\boldsymbol{z}_\phi = \boldsymbol{W}_\phi \phi(x_1)$ (b) Encoder is replaced by one-hot $\phi_\text{OH}(x_1) = [\phi'_1, \cdots, \phi'_{c}]^T$ and the contribution of $x_1$ to input of first hidden layer is given by the orange sub-network.
  • Figure 2: Model performance using the best encoder and one-hot encoder on datasets generated from Equation \ref{['equ:syn_data1']}. The x-axis represents the ASPL value. The left y-axis displays the performance (i.e. Mean Squared Error) using line plots with markers. The right y-axis exhibits the performance difference between two encoders using a bar plot. The shaded region corresponds to the 95% confidence interval. (a). Single-Layer NN; (b). Linear Regression.
  • Figure 3: The average contribution difference between the one-hot encoding and the best encoder. The shaded region corresponds to a 95% confidence interval.(a). Single-Layer NN; (b). Linear Regression.
  • Figure 4: (a) Test set with 1000 samples generated using Equation \ref{['equ:syn_data2']}; (b) Training set consisting of 40 samples with an ASPL value of 10. The numerical values above each level are obtained by the Mean Encoder; (c) Prediction on the test dataset by a Random Forest model trained with the aforementioned training set and Mean Encoder, yielding an overall accuracy of 0.730.
  • Figure 5: Performance of Random Forest using the best encoder and Mean Encoder on datasets generated from Equation \ref{['equ:syn_data2']}. The x-axis represents the ASPL value. The left y-axis displays the performance (i.e., Accuracy) using line plots with markers. The right y-axis exhibits the performance difference between two encoders using a bar plot. The shaded region corresponds to the 95% confidence interval.
  • ...and 5 more figures

Theorems & Definitions (2)

  • Theorem 1
  • Definition 1