Table of Contents
Fetching ...

Transforming variables to central normality

Jakob Raymaekers, Peter J. Rousseeuw

TL;DR

This paper tackles the susceptibility of Box-Cox and Yeo-Johnson transformation parameters to outliers by proposing a robust framework that targets central normality rather than full distributional normality. It introduces a robust objective for estimating $\lambda$, rectified transformations to avoid masking outliers, and a two-step reweighted maximum likelihood estimator (RewML) that downweights outliers while refining $\lambda$. Through extensive simulations and real-data examples, RewML consistently reduces bias and MSE in contaminated settings and reveals clearer structure in downstream analyses. The approach provides a practical preprocessing tool for anomaly detection and predictive modeling, available in the RR2020 package as transfo.

Abstract

Many real data sets contain numerical features (variables) whose distribution is far from normal (gaussian). Instead, their distribution is often skewed. In order to handle such data it is customary to preprocess the variables to make them more normal. The Box-Cox and Yeo-Johnson transformations are well-known tools for this. However, the standard maximum likelihood estimator of their transformation parameter is highly sensitive to outliers, and will often try to move outliers inward at the expense of the normality of the central part of the data. We propose a modification of these transformations as well as an estimator of the transformation parameter that is robust to outliers, so the transformed data can be approximately normal in the center and a few outliers may deviate from it. It compares favorably to existing techniques in an extensive simulation study and on real data.

Transforming variables to central normality

TL;DR

This paper tackles the susceptibility of Box-Cox and Yeo-Johnson transformation parameters to outliers by proposing a robust framework that targets central normality rather than full distributional normality. It introduces a robust objective for estimating , rectified transformations to avoid masking outliers, and a two-step reweighted maximum likelihood estimator (RewML) that downweights outliers while refining . Through extensive simulations and real-data examples, RewML consistently reduces bias and MSE in contaminated settings and reveals clearer structure in downstream analyses. The approach provides a practical preprocessing tool for anomaly detection and predictive modeling, available in the RR2020 package as transfo.

Abstract

Many real data sets contain numerical features (variables) whose distribution is far from normal (gaussian). Instead, their distribution is often skewed. In order to handle such data it is customary to preprocess the variables to make them more normal. The Box-Cox and Yeo-Johnson transformations are well-known tools for this. However, the standard maximum likelihood estimator of their transformation parameter is highly sensitive to outliers, and will often try to move outliers inward at the expense of the normality of the central part of the data. We propose a modification of these transformations as well as an estimator of the transformation parameter that is robust to outliers, so the transformed data can be approximately normal in the center and a few outliers may deviate from it. It compares favorably to existing techniques in an extensive simulation study and on real data.

Paper Structure

This paper contains 25 sections, 22 equations, 20 figures.

Figures (20)

  • Figure 1: The Box-Cox (left) and Yeo-Johnson (right) transformations for several parameters $\lambda$.
  • Figure 2: Normal QQ-plot of the variable MPG in the Top Gear dataset (left) and the Box-Cox transformed variable using the maximum likelihood estimate of $\lambda$ (right). The ML estimate is heavily affected by the three outliers at the top, causing it to create skewness in the central part of the transformed data.
  • Figure 3: Normal QQ-plot of the variable Weight in the Top Gear dataset (left) and the transformed variable using the ML estimate of $\lambda$ (right). The transformation does not make the five outliers at the bottom stand out.
  • Figure 4: The rectified Box-Cox (left) and Yeo-Johnson (right) transformations for a range of parameters $\lambda$. They look quite similar to the original transformations in Figure \ref{['fig:transformations']} but contract less on the right when $\lambda < 1$, and contract less on the left when $\lambda > 1$.
  • Figure 5: Sensitivity curves of estimators of the parameter $\lambda$ in the Yeo-Johnson (top) and Box-Cox (bottom) transformations, with sample size $n = 100$.
  • ...and 15 more figures