Table of Contents
Fetching ...

Detection of Multiple Influential Observations on Model Selection

Dongliang Zhang, Masoud Asgharian, Martin A. Lindquist

Abstract

Outlying observations are frequently encountered across a wide spectrum of scientific domains, posing notable challenges to the generalizability of statistical models and the reproducibility of downstream analysis. They are identified through influential diagnostics, which aim to capture observations that unduly bias model estimation. To date, methods for identifying observations that influence the selection of a stochastically chosen submodel have been underdeveloped, especially in the high-dimensional setting where the number of predictors $p$ exceeds the sample size $n$. Recently we proposed an improved diagnostic measure to handle this setting. However, its distributional properties and approximations have not yet been explored. To address this shortcoming, we revisit the notion of exchangeability to determine the exact asymptotic distribution of our assessment measure. This foundation enables the introduction of theoretically supported parametric and nonparametric approaches for distributional approximation and derivation of thresholds for outlier identification. The resulting framework is further extended to logistic regression models and evaluated by comprehensive simulation studies comparing the performance of various detection methods. Finally, the framework is applied to data from a task-based fMRI study of thermal pain, with the goal of identifying outliers that distort the formulation of the statistical model using functional brain activity to predict physical pain ratings. Both linear and logistic models are used to demonstrate the benefits of detection and compare the performance of different detection procedures. In particular, we identify two influential observations that were not detected in prior studies

Detection of Multiple Influential Observations on Model Selection

Abstract

Outlying observations are frequently encountered across a wide spectrum of scientific domains, posing notable challenges to the generalizability of statistical models and the reproducibility of downstream analysis. They are identified through influential diagnostics, which aim to capture observations that unduly bias model estimation. To date, methods for identifying observations that influence the selection of a stochastically chosen submodel have been underdeveloped, especially in the high-dimensional setting where the number of predictors exceeds the sample size . Recently we proposed an improved diagnostic measure to handle this setting. However, its distributional properties and approximations have not yet been explored. To address this shortcoming, we revisit the notion of exchangeability to determine the exact asymptotic distribution of our assessment measure. This foundation enables the introduction of theoretically supported parametric and nonparametric approaches for distributional approximation and derivation of thresholds for outlier identification. The resulting framework is further extended to logistic regression models and evaluated by comprehensive simulation studies comparing the performance of various detection methods. Finally, the framework is applied to data from a task-based fMRI study of thermal pain, with the goal of identifying outliers that distort the formulation of the statistical model using functional brain activity to predict physical pain ratings. Both linear and logistic models are used to demonstrate the benefits of detection and compare the performance of different detection procedures. In particular, we identify two influential observations that were not detected in prior studies

Paper Structure

This paper contains 13 sections, 2 theorems, 2 equations, 2 figures, 2 tables.

Key Result

Theorem 1

Let $\tau_i$ denote the GDF metric in gdf and $\delta_{i}$ be the DF(LASSO) measure defined above. Then, as $p \rightarrow \infty$, both $\tau_i$ and $\delta_i$ follow finite mixtures of binomial distributions.

Figures (2)

  • Figure 1: Real data analysis on pain prediction using fMRI data. Panel (A): comparison of predictive performance before and after removing detected inf luential points under linear and logistic regression models. Panels (B) to (D): for the revised ClusMIP applied to the LASSO, SCAD and MCP, the distribution of detected inf luential observations and the remaining clean observations under linear and logistic regression models mapped back to the pain ratings classified according to the six temperatures.
  • Figure 2: Real data analysis on pain prediction using fMRI data: selection of brain regions and the corresponding magnitude of the LASSO regression coefficients. Panel (A): linear regression model based on the full dataset. Panel (B): linear regression model based on the reduced dataset upon removing inf luential points given by $\widehat{\text{I}}_{\text{linear}}(\text{ClusMIP(LASSO)})$ with Boot-I. Panel (C): logistic regression model based on the full dataset. Panel (D): logistic regression model based on the reduced dataset upon removing inf luential points given by $\widehat{\text{I}}_{\text{logistic}}(\text{ClusMIP(LASSO)})$ with Boot-I. Here, Boot-I is the first bootstrap scheme to derive threshold of $\tau_{[n]}$ for diagnosis purposes discussed in Section \ref{['sec:nonparametric']}.

Theorems & Definitions (2)

  • Theorem 1
  • Theorem 2