Naive Bayes Classifiers and One-hot Encoding of Categorical Variables
Christopher K. I. Williams
TL;DR
This work analyzes the impact of one-hot encoding a K-valued categorical feature on Naïve Bayes classifiers, showing that treating the encoded bits as independent Bernoulli variables yields a product-of-Bernoullis (PoB) model rather than the correct categorical NB. The authors derive the key quantity $Q^{-j}$, bound its behavior on the simplex, and prove that PoB can overcount evidence compared to the true categorical model. Through Dirichlet-based simulations, they show PoB posteriors are typically larger and MAP decisions largely align with the categorical model, with disagreements diminishing as the number of categories grows. The findings stress careful encoding and metadata practices to avoid misapplication, especially when handling multiple categorical inputs, since the $Q^{-j}$ factors can amplify or cancel evidence depending on priors and feature sparsity.
Abstract
This paper investigates the consequences of encoding a $K$-valued categorical variable incorrectly as $K$ bits via one-hot encoding, when using a Naïve Bayes classifier. This gives rise to a product-of-Bernoullis (PoB) assumption, rather than the correct categorical Naïve Bayes classifier. The differences between the two classifiers are analysed mathematically and experimentally. In our experiments using probability vectors drawn from a Dirichlet distribution, the two classifiers are found to agree on the maximum a posteriori class label for most cases, although the posterior probabilities are usually greater for the PoB case.
