Table of Contents
Fetching ...

R2VF: A Two-Step Regularization Algorithm to Cluster Categories in GLMs

Yuval Ben Dror

TL;DR

R2VF tackles clustering nominal categories in GLMs with a two-step regularization that first ranks nominal category effects via a regularized regression and then applies variable fusion to produce sparse, interpretable clusters. It formalizes a penalty $J_\lambda(\beta)$ that combines ordinal fusion terms $|\beta_{i j}-\beta_{i j-1}|^\alpha$ and nominal-difference terms $|\beta_{i j}-\beta_{i k}|^\alpha$, with $\lambda$ and $\alpha \in \{1,2\}$. The method demonstrates competitive log-loss and substantially reduced covariate counts versus ordinary Lasso with VF and CatBoost on simulated and real data, illustrating practical gains in interpretability for high-cardinality categoricals. It offers a scalable, explainable alternative for GLMs with nominal features and suggests future refinements including Elastic Net extensions and more efficient hyperparameter strategies.

Abstract

Over recent decades, extensive research has aimed to overcome the restrictive underlying assumptions required for a Generalized Linear Model to generate accurate and meaningful predictions. These efforts include regularizing coefficients, selecting features, and clustering ordinal categories, among other approaches. Despite these advances, efficiently clustering nominal categories in GLMs without incurring high computational costs remains a challenge. This paper introduces Ranking to Variable Fusion (R2VF), a two-step method designed to efficiently fuse nominal and ordinal categories in GLMs. By first transforming nominal features into an ordinal framework via regularized regression and then applying variable fusion, R2VF strikes a balance between model complexity and interpretability. We demonstrate the effectiveness of R2VF through comparisons with other methods, highlighting its performance in addressing overfitting and identifying an appropriate set of covariates.

R2VF: A Two-Step Regularization Algorithm to Cluster Categories in GLMs

TL;DR

R2VF tackles clustering nominal categories in GLMs with a two-step regularization that first ranks nominal category effects via a regularized regression and then applies variable fusion to produce sparse, interpretable clusters. It formalizes a penalty that combines ordinal fusion terms and nominal-difference terms , with and . The method demonstrates competitive log-loss and substantially reduced covariate counts versus ordinary Lasso with VF and CatBoost on simulated and real data, illustrating practical gains in interpretability for high-cardinality categoricals. It offers a scalable, explainable alternative for GLMs with nominal features and suggests future refinements including Elastic Net extensions and more efficient hyperparameter strategies.

Abstract

Over recent decades, extensive research has aimed to overcome the restrictive underlying assumptions required for a Generalized Linear Model to generate accurate and meaningful predictions. These efforts include regularizing coefficients, selecting features, and clustering ordinal categories, among other approaches. Despite these advances, efficiently clustering nominal categories in GLMs without incurring high computational costs remains a challenge. This paper introduces Ranking to Variable Fusion (R2VF), a two-step method designed to efficiently fuse nominal and ordinal categories in GLMs. By first transforming nominal features into an ordinal framework via regularized regression and then applying variable fusion, R2VF strikes a balance between model complexity and interpretability. We demonstrate the effectiveness of R2VF through comparisons with other methods, highlighting its performance in addressing overfitting and identifying an appropriate set of covariates.

Paper Structure

This paper contains 9 sections, 7 equations, 5 figures, 1 algorithm.

Figures (5)

  • Figure 1: the x-axis represents the categories of $city$. The y-axis represents the coefficients – the circles are the true coefficients, and the squares are the lasso coefficients. This graph illustrates how some cities are merged with A, thus having their coefficient "dragged upwards".
  • Figure 2: Overview of the R2VF Algorithm Flow
  • Figure 3: the x-axis represents the categories of $city$ ordered by their true coefficient. The y-axis represents the coefficients – the circles are the true coefficients, and the squares are the coefficients given by the Step-3 model (the ranking step), after fixing the reference level to be “a” by adding a constant value.
  • Figure 4: the x-axis represents the categories of $city$ ordered by their true coefficient. The y-axis represents the coefficients – the circles are the true coefficients, and the squares are the coefficients given by the final model. Note that some categories got the same coefficient, meaning they were merged to the same cluster.
  • Figure 5: Log-Loss on test data for 5 separate sets of train-test splits. The x-axis represents the number of covariates used in the final model – for Catboost, it means the number of trees created (since its depth is 1). The boxplots are located on the mean value of covariates used.