Naive Bayes Classifiers and One-hot Encoding of Categorical Variables

Christopher K. I. Williams

Naive Bayes Classifiers and One-hot Encoding of Categorical Variables

Christopher K. I. Williams

TL;DR

This work analyzes the impact of one-hot encoding a K-valued categorical feature on Naïve Bayes classifiers, showing that treating the encoded bits as independent Bernoulli variables yields a product-of-Bernoullis (PoB) model rather than the correct categorical NB. The authors derive the key quantity $Q^{-j}$, bound its behavior on the simplex, and prove that PoB can overcount evidence compared to the true categorical model. Through Dirichlet-based simulations, they show PoB posteriors are typically larger and MAP decisions largely align with the categorical model, with disagreements diminishing as the number of categories grows. The findings stress careful encoding and metadata practices to avoid misapplication, especially when handling multiple categorical inputs, since the $Q^{-j}$ factors can amplify or cancel evidence depending on priors and feature sparsity.

Abstract

This paper investigates the consequences of encoding a $K$-valued categorical variable incorrectly as $K$ bits via one-hot encoding, when using a Naïve Bayes classifier. This gives rise to a product-of-Bernoullis (PoB) assumption, rather than the correct categorical Naïve Bayes classifier. The differences between the two classifiers are analysed mathematically and experimentally. In our experiments using probability vectors drawn from a Dirichlet distribution, the two classifiers are found to agree on the maximum a posteriori class label for most cases, although the posterior probabilities are usually greater for the PoB case.

Naive Bayes Classifiers and One-hot Encoding of Categorical Variables

TL;DR

, bound its behavior on the simplex, and prove that PoB can overcount evidence compared to the true categorical model. Through Dirichlet-based simulations, they show PoB posteriors are typically larger and MAP decisions largely align with the categorical model, with disagreements diminishing as the number of categories grows. The findings stress careful encoding and metadata practices to avoid misapplication, especially when handling multiple categorical inputs, since the

factors can amplify or cancel evidence depending on priors and feature sparsity.

Abstract

This paper investigates the consequences of encoding a

-valued categorical variable incorrectly as

bits via one-hot encoding, when using a Naïve Bayes classifier. This gives rise to a product-of-Bernoullis (PoB) assumption, rather than the correct categorical Naïve Bayes classifier. The differences between the two classifiers are analysed mathematically and experimentally. In our experiments using probability vectors drawn from a Dirichlet distribution, the two classifiers are found to agree on the maximum a posteriori class label for most cases, although the posterior probabilities are usually greater for the PoB case.

Paper Structure (7 sections, 13 equations, 3 figures)

This paper contains 7 sections, 13 equations, 3 figures.

Analysis of $Q^{-j}$
The effect of the $Q^{-j}$ factors on the Naı̈ve Bayes classifier
Experiments
Consequences of the $f_j(\theta_c)/f_{j}(\theta_d)$ transformation for the winning class.
Comparing posterior probabilities under the categorical and PoB models:
Comparing the MAP class assignment under the categorical and PoB models:
Discussion

Figures (3)

Figure 1: (a) Plot of $Q^{-4}(\theta_1,\theta_2,\theta_3)$ on the simplex, for $\theta_4 =0$. (b) Upper and lower bounds on $f_j(\boldsymbol{\theta})$ against $\theta_j$ for $K=6$.
Figure 2: (a) A plot of $\log (f_j(\boldsymbol{\theta}_c)/f_j(\boldsymbol{\theta}_d))$ against $\log(\theta_{jc}/\theta_{jd})$ for $K=3$ and $\alpha=1$. (b) The same, but for $K=3$ and $\alpha=1/3$.
Figure 3: Plot showing the maximum posterior probability for the PoB model against the maximum posterior probability for the categorical model, for $K=3$ and $\alpha=1$. Datapoints in blue show occasions where the MAP class assignment is the same under both models, while the red points mark where they disagree.

Naive Bayes Classifiers and One-hot Encoding of Categorical Variables

TL;DR

Abstract

Naive Bayes Classifiers and One-hot Encoding of Categorical Variables

Authors

TL;DR

Abstract

Table of Contents

Figures (3)