Table of Contents
Fetching ...

DEViaN-LM: An R Package for Detecting Abnormal Values in the Gaussian Linear Model

Geoffroy Berthelot, Guillaume Saulière, Jérôme Dedecker

TL;DR

The paper addresses detecting abnormal values poorly explained by a Gaussian linear model by leveraging the maximum absolute value of externally studentized residuals, $T_n = \max_i |\hat e_i(X)|$, whose distribution is free of the unknown parameters $\theta$ and $\sigma^2$ when conditioned on the design $M$. Because the distribution depends on the design, the authors propose Monte-Carlo estimation of quantiles $c_{\alpha,n}$ and p-values for a given $M$ within the DEViaN-LM R package. The package returns the residuals, outlier indices, the threshold, and a binary outlier indicator, enabling automated abnormal-value detection across real datasets. They demonstrate applications to biological and sociological data, and show favorable runtime performance relative to naïve implementations. The work provides a practical, design-aware tool for individualized outlier detection in Gaussian linear models with clear implications for precision medicine and longitudinal monitoring.

Abstract

The DEViaN-LM is a R package that allows to detect the values poorly explained by a Gaussian linear model. The procedure is based on the maximum of the absolute value of the studentized residuals, which is a free statistic of the parameters of the model. This approach makes it possible to generalize several procedures used to detect abnormal values during longitudinal monitoring of certain biological markers. In this article, we describe the method used, and we show how to implement it on different real datasets.

DEViaN-LM: An R Package for Detecting Abnormal Values in the Gaussian Linear Model

TL;DR

The paper addresses detecting abnormal values poorly explained by a Gaussian linear model by leveraging the maximum absolute value of externally studentized residuals, , whose distribution is free of the unknown parameters and when conditioned on the design . Because the distribution depends on the design, the authors propose Monte-Carlo estimation of quantiles and p-values for a given within the DEViaN-LM R package. The package returns the residuals, outlier indices, the threshold, and a binary outlier indicator, enabling automated abnormal-value detection across real datasets. They demonstrate applications to biological and sociological data, and show favorable runtime performance relative to naïve implementations. The work provides a practical, design-aware tool for individualized outlier detection in Gaussian linear models with clear implications for precision medicine and longitudinal monitoring.

Abstract

The DEViaN-LM is a R package that allows to detect the values poorly explained by a Gaussian linear model. The procedure is based on the maximum of the absolute value of the studentized residuals, which is a free statistic of the parameters of the model. This approach makes it possible to generalize several procedures used to detect abnormal values during longitudinal monitoring of certain biological markers. In this article, we describe the method used, and we show how to implement it on different real datasets.

Paper Structure

This paper contains 12 sections, 1 theorem, 15 equations, 2 figures, 1 table.

Key Result

Proposition 1

In Model mod, the following equality holds: for any $i \in \{1, \ldots, n\}$ Consequently, the distribution of $(\hat{e}_1(X), \ldots, \hat{e}_n(X))'$ does not dependent on $(\theta, \sigma^2)$.

Figures (2)

  • Figure 1: Application of DEViaN-LM to biological data in 4 examples. Black dots represent the data points, red dots show abnormal values, and red lines represent the quantiles for the requested $\alpha$ level (for all pictures we took $\alpha= 5 \%$). The upper left panel illustrates the detection of abnormal values in the serum iron samples for one elite soccer player (A). In the upper right panel (B); the abnormal values of distance per week, adjusted by age and bodymass (weight) are given for mice. The lower left panel (C) reveals abnormal salaries adjusted by age, educational level and the number of children. The lower right panel (D) shows the mouse lemurs which present abnormal grip strength values, when adjusting for age and weight.
  • Figure 2: Benchmark results. In the left panel (A), the median runtime is given for different sample-sizes and different number of CPU cores. Each measure (node) is repeated 200 times. Similarly, in the right panel (B), the median runtime is given for an increasing number of simulations nsimul using $100\%$ of the WAGE dataset. Each measure is repeated 100 times.

Theorems & Definitions (3)

  • Proposition 1
  • Remark 2
  • Remark 3