R2VF: A Two-Step Regularization Algorithm to Cluster Categories in GLMs
Yuval Ben Dror
TL;DR
R2VF tackles clustering nominal categories in GLMs with a two-step regularization that first ranks nominal category effects via a regularized regression and then applies variable fusion to produce sparse, interpretable clusters. It formalizes a penalty $J_\lambda(\beta)$ that combines ordinal fusion terms $|\beta_{i j}-\beta_{i j-1}|^\alpha$ and nominal-difference terms $|\beta_{i j}-\beta_{i k}|^\alpha$, with $\lambda$ and $\alpha \in \{1,2\}$. The method demonstrates competitive log-loss and substantially reduced covariate counts versus ordinary Lasso with VF and CatBoost on simulated and real data, illustrating practical gains in interpretability for high-cardinality categoricals. It offers a scalable, explainable alternative for GLMs with nominal features and suggests future refinements including Elastic Net extensions and more efficient hyperparameter strategies.
Abstract
Over recent decades, extensive research has aimed to overcome the restrictive underlying assumptions required for a Generalized Linear Model to generate accurate and meaningful predictions. These efforts include regularizing coefficients, selecting features, and clustering ordinal categories, among other approaches. Despite these advances, efficiently clustering nominal categories in GLMs without incurring high computational costs remains a challenge. This paper introduces Ranking to Variable Fusion (R2VF), a two-step method designed to efficiently fuse nominal and ordinal categories in GLMs. By first transforming nominal features into an ordinal framework via regularized regression and then applying variable fusion, R2VF strikes a balance between model complexity and interpretability. We demonstrate the effectiveness of R2VF through comparisons with other methods, highlighting its performance in addressing overfitting and identifying an appropriate set of covariates.
