Outlier detection in non-elliptical data by kernel MRCD
Joachim Schreurs, Iwein Vranckx, Mia Hubert, Johan A. K. Suykens, Peter J. Rousseeuw
TL;DR
The paper introduces Kernel MRCD (KMRCD), a robust multivariate estimator that extends the Minimum Regularized Covariance Determinant (MRCD) to non-elliptical data by performing MRCD in a kernel-induced feature space $\mathcal{F}$. It preserve the core MRCD objective via a kernel determinant $\det(\tilde{K}^H_{\mathrm{reg}})$ and provides kernelized initial estimators, a refinement step, bandwidth selection via the median heuristic, and a principled outlier cutoff based on a lognormal approximation of robust distances. Through extensive simulations, KMRCD demonstrates comparable robustness and substantial computational speedups when $p$ is large, and outperforms MRCD on non-elliptical data with nonlinear kernels; experiments on real data (food industry and MNIST) show improved outlier detection and denoising performance. The method offers a practical, scalable approach for robust multivariate analysis in high-dimensional, non-elliptical settings, with freely available MATLAB code.
Abstract
The minimum regularized covariance determinant method (MRCD) is a robust estimator for multivariate location and scatter, which detects outliers by fitting a robust covariance matrix to the data. Its regularization ensures that the covariance matrix is well-conditioned in any dimension. The MRCD assumes that the non-outlying observations are roughly elliptically distributed, but many datasets are not of that form. Moreover, the computation time of MRCD increases substantially when the number of variables goes up, and nowadays datasets with many variables are common. The proposed Kernel Minimum Regularized Covariance Determinant (KMRCD) estimator addresses both issues. It is not restricted to elliptical data because it implicitly computes the MRCD estimates in a kernel induced feature space. A fast algorithm is constructed that starts from kernel-based initial estimates and exploits the kernel trick to speed up the subsequent computations. Based on the KMRCD estimates, a rule is proposed to flag outliers. The KMRCD algorithm performs well in simulations, and is illustrated on real-life data.
