Table of Contents
Fetching ...

Least trimmed squares regression with missing values and cellwise outliers

Jakob Raymaekers, Peter J. Rousseeuw

Abstract

Regression is the workhorse of statistics, and is often faced with real data that contain outliers. When these are casewise outliers, that is, cases that are entirely wrong or belong to a different population, the issue can be remedied by existing casewise robust regression methods. It is another matter when cellwise outliers occur, that is, suspicious individual entries in the data matrix containing the regressors and the response. We propose a new regression method that is robust to both casewise and cellwise outliers, and handles missing values as well. Its construction allows for skewed distributions. We show that it obeys the first breakdown result for cellwise robust regression. It is also the first such method that is geared to making robust out-of-sample predictions. Its performance is studied by simulation, and it is illustrated on a substantial real dataset.

Least trimmed squares regression with missing values and cellwise outliers

Abstract

Regression is the workhorse of statistics, and is often faced with real data that contain outliers. When these are casewise outliers, that is, cases that are entirely wrong or belong to a different population, the issue can be remedied by existing casewise robust regression methods. It is another matter when cellwise outliers occur, that is, suspicious individual entries in the data matrix containing the regressors and the response. We propose a new regression method that is robust to both casewise and cellwise outliers, and handles missing values as well. Its construction allows for skewed distributions. We show that it obeys the first breakdown result for cellwise robust regression. It is also the first such method that is geared to making robust out-of-sample predictions. Its performance is studied by simulation, and it is illustrated on a substantial real dataset.
Paper Structure (25 sections, 22 equations, 27 figures, 4 tables)

This paper contains 25 sections, 22 equations, 27 figures, 4 tables.

Figures (27)

  • Figure 1: A toy example to illustrate the basic idea of the method.
  • Figure 2: Top: average MD (on log scale) of the estimated coefficients for $n = 400$, $d = 20$, $\varepsilon = 20\%$ of cellwise outliers, and $\boldsymbol{\Sigma} = \boldsymbol{\Sigma}_{\hbox{\scriptsize ALYZ}}$ (left) or $\boldsymbol{\Sigma} = \boldsymbol{\Sigma}_{\hbox{\scriptsize A09}}$ (right), for normal predictors. Bottom: corresponding MSE, also on log scale.
  • Figure 3: Like Figure \ref{['fig:MD_MSE_normal']}, but for exponential predictors.
  • Figure 4: Like Figure \ref{['fig:MD_MSE_normal']}, but for lognormal predictors.
  • Figure 5: Top row: average MD (on log scale) of the estimated coefficients for different symmetrization strategies and normal predictors. The data has dimension $d = 20$, $\varepsilon = 20\%$ of cellwise outliers, and $\boldsymbol{\Sigma} = \boldsymbol{\Sigma}_{\hbox{\scriptsize ALYZ}}$ (left) or $\boldsymbol{\Sigma} = \boldsymbol{\Sigma}_{\hbox{\scriptsize A09}}$ (right). Middle row: same for exponential predictors. Bottom row: same for lognormal predictors.
  • ...and 22 more figures