Table of Contents
Fetching ...

Outlier detection in non-elliptical data by kernel MRCD

Joachim Schreurs, Iwein Vranckx, Mia Hubert, Johan A. K. Suykens, Peter J. Rousseeuw

TL;DR

The paper introduces Kernel MRCD (KMRCD), a robust multivariate estimator that extends the Minimum Regularized Covariance Determinant (MRCD) to non-elliptical data by performing MRCD in a kernel-induced feature space $\mathcal{F}$. It preserve the core MRCD objective via a kernel determinant $\det(\tilde{K}^H_{\mathrm{reg}})$ and provides kernelized initial estimators, a refinement step, bandwidth selection via the median heuristic, and a principled outlier cutoff based on a lognormal approximation of robust distances. Through extensive simulations, KMRCD demonstrates comparable robustness and substantial computational speedups when $p$ is large, and outperforms MRCD on non-elliptical data with nonlinear kernels; experiments on real data (food industry and MNIST) show improved outlier detection and denoising performance. The method offers a practical, scalable approach for robust multivariate analysis in high-dimensional, non-elliptical settings, with freely available MATLAB code.

Abstract

The minimum regularized covariance determinant method (MRCD) is a robust estimator for multivariate location and scatter, which detects outliers by fitting a robust covariance matrix to the data. Its regularization ensures that the covariance matrix is well-conditioned in any dimension. The MRCD assumes that the non-outlying observations are roughly elliptically distributed, but many datasets are not of that form. Moreover, the computation time of MRCD increases substantially when the number of variables goes up, and nowadays datasets with many variables are common. The proposed Kernel Minimum Regularized Covariance Determinant (KMRCD) estimator addresses both issues. It is not restricted to elliptical data because it implicitly computes the MRCD estimates in a kernel induced feature space. A fast algorithm is constructed that starts from kernel-based initial estimates and exploits the kernel trick to speed up the subsequent computations. Based on the KMRCD estimates, a rule is proposed to flag outliers. The KMRCD algorithm performs well in simulations, and is illustrated on real-life data.

Outlier detection in non-elliptical data by kernel MRCD

TL;DR

The paper introduces Kernel MRCD (KMRCD), a robust multivariate estimator that extends the Minimum Regularized Covariance Determinant (MRCD) to non-elliptical data by performing MRCD in a kernel-induced feature space . It preserve the core MRCD objective via a kernel determinant and provides kernelized initial estimators, a refinement step, bandwidth selection via the median heuristic, and a principled outlier cutoff based on a lognormal approximation of robust distances. Through extensive simulations, KMRCD demonstrates comparable robustness and substantial computational speedups when is large, and outperforms MRCD on non-elliptical data with nonlinear kernels; experiments on real data (food industry and MNIST) show improved outlier detection and denoising performance. The method offers a practical, scalable approach for robust multivariate analysis in high-dimensional, non-elliptical settings, with freely available MATLAB code.

Abstract

The minimum regularized covariance determinant method (MRCD) is a robust estimator for multivariate location and scatter, which detects outliers by fitting a robust covariance matrix to the data. Its regularization ensures that the covariance matrix is well-conditioned in any dimension. The MRCD assumes that the non-outlying observations are roughly elliptically distributed, but many datasets are not of that form. Moreover, the computation time of MRCD increases substantially when the number of variables goes up, and nowadays datasets with many variables are common. The proposed Kernel Minimum Regularized Covariance Determinant (KMRCD) estimator addresses both issues. It is not restricted to elliptical data because it implicitly computes the MRCD estimates in a kernel induced feature space. A fast algorithm is constructed that starts from kernel-based initial estimates and exploits the kernel trick to speed up the subsequent computations. Based on the KMRCD estimates, a rule is proposed to flag outliers. The KMRCD algorithm performs well in simulations, and is illustrated on real-life data.

Paper Structure

This paper contains 26 sections, 2 theorems, 44 equations, 9 figures, 8 tables, 6 algorithms.

Key Result

Theorem 1

Given an $n \times p$ dataset $X$, the sorted eigenvalues of the covariance matrix $\hat{\Sigma}_\mathcal{F}$ and those of the centered kernel matrix $\tilde{K}$ satisfy for all $j=1, \ldots, m$ where $m=\mathrm{rank}(\hat{\Sigma}_\mathcal{F})$.

Figures (9)

  • Figure 1: Illustration of kernel MRCD on two datasets of which the non-outlying part is elliptical (left) and non-elliptical (right). Both datasets contain $20\%$ of outlying observations. The generated regular observations are shown in black and the outliers in red. In the panel on the left a linear kernel was used, and in the panel on the right a nonlinear kernel. The curves on the left are contours of the robust Mahalanobis distance in the original bivariate space. The contours on the right are based on the robust distance in the kernel-induced feature space.
  • Figure 2: Results of the non-kernel MCD method on the toy datasets of Figure \ref{['fig:toy1']}. The contour lines are level curves of the MCD-based Mahalanobis distance.
  • Figure 3: Kernel MRCD results on the toy datasets of Figure \ref{['fig:toy1']}. In the left column the linear kernel was used, and in the right column the RBF kernel. The three stages (a), (b) and (c) are explained in the text.
  • Figure 4: Illustration of the non-elliptical simulation setting with data generated from the $t$ copula, plus $20\%$ of outlying observations. In the left panel, the regular observations are shown in black and the outliers in red. The results of the MRCD estimator are in the middle panel, and those of the KMRCD estimator in the rightmost panel, each for $h = 0.75n$. In those panels the points in the $h$-subset are shown in green, and the other points with the $n(1-\varepsilon)$ lowest (kernel) Mahalanobis distance are depicted in grey. The remaining points are shown in red. The curves are contours of the robust (kernel) Mahalanobis distance.
  • Figure 5: Illustration of the non-elliptical simulation setting with data generated from the circle manifold, plus $20\%$ of outlying observations. The remainder of the description is as in Figure \ref{['fig:nonLinSimulation_T']}.
  • ...and 4 more figures

Theorems & Definitions (5)

  • Definition 1
  • Theorem 1
  • proof : Proof of Theorem \ref{['thm:equalrank']}
  • Theorem 2
  • proof : Proof of Theorem \ref{['thm:equalobj']}