Table of Contents
Fetching ...

Simultaneous Feature Selection and Outlier Detection with Optimality Guarantees

Luca Insolia, Ana Kenney, Francesca Chiaromonte, Giovanni Felici

TL;DR

This work tackles high-dimensional regression contaminated by multiple mean-shift outliers using a Mean-Shift Outlier Model (MSOM). It introduces a discrete optimization framework that jointly performs sparse feature selection and outlier detection via a two-set $L_0$-constrained MIP, with proven guarantees including a breakdown point of $\frac{k_n+1}{n}$ and a robustly strong oracle property under $L_2$ loss. The methodology is augmented with practical implementation strategies (robust standardization, ensemble-based big-$M$ bounds, and SOS-1 formulations) and validated through extensive simulations and an application to the microbiome–childhood obesity relationship, where it yields sparse, interpretable, and predictive models. The results demonstrate superior performance and resilience to outliers compared to heuristic robust methods, suggesting substantial utility for high-dimensional robust inference in genomics, microbiome, and related domains.

Abstract

Sparse estimation methods capable of tolerating outliers have been broadly investigated in the last decade. We contribute to this research considering high-dimensional regression problems contaminated by multiple mean-shift outliers which affect both the response and the design matrix. We develop a general framework for this class of problems and propose the use of mixed-integer programming to simultaneously perform feature selection and outlier detection with provably optimal guarantees. We characterize the theoretical properties of our approach, i.e. a necessary and sufficient condition for the robustly strong oracle property, which allows the number of features to exponentially increase with the sample size; the optimal estimation of the parameters; and the breakdown point of the resulting estimates. Moreover, we provide computationally efficient procedures to tune integer constraints and to warm-start the algorithm. We show the superior performance of our proposal compared to existing heuristic methods through numerical simulations and an application investigating the relationships between the human microbiome and childhood obesity.

Simultaneous Feature Selection and Outlier Detection with Optimality Guarantees

TL;DR

This work tackles high-dimensional regression contaminated by multiple mean-shift outliers using a Mean-Shift Outlier Model (MSOM). It introduces a discrete optimization framework that jointly performs sparse feature selection and outlier detection via a two-set -constrained MIP, with proven guarantees including a breakdown point of and a robustly strong oracle property under loss. The methodology is augmented with practical implementation strategies (robust standardization, ensemble-based big- bounds, and SOS-1 formulations) and validated through extensive simulations and an application to the microbiome–childhood obesity relationship, where it yields sparse, interpretable, and predictive models. The results demonstrate superior performance and resilience to outliers compared to heuristic robust methods, suggesting substantial utility for high-dimensional robust inference in genomics, microbiome, and related domains.

Abstract

Sparse estimation methods capable of tolerating outliers have been broadly investigated in the last decade. We contribute to this research considering high-dimensional regression problems contaminated by multiple mean-shift outliers which affect both the response and the design matrix. We develop a general framework for this class of problems and propose the use of mixed-integer programming to simultaneously perform feature selection and outlier detection with provably optimal guarantees. We characterize the theoretical properties of our approach, i.e. a necessary and sufficient condition for the robustly strong oracle property, which allows the number of features to exponentially increase with the sample size; the optimal estimation of the parameters; and the breakdown point of the resulting estimates. Moreover, we provide computationally efficient procedures to tune integer constraints and to warm-start the algorithm. We show the superior performance of our proposal compared to existing heuristic methods through numerical simulations and an application investigating the relationships between the human microbiome and childhood obesity.

Paper Structure

This paper contains 8 sections, 6 theorems, 14 equations, 6 tables.

Key Result

Proposition 1

For any $\lambda$, $n$, $p$, $k_n$ and $k_p$, the $\hat{\bm{\beta}}$ estimator produced solving eq:miqp is the same as the one produced solving where $e_i$ (for $i=1,\ldots,n$) are the residuals, and $(\rho ( e_1 ))_{1:n} \leq \ldots \leq (\rho ( e_n ))_{n:n}$ the order statistics of their $\rho(\cdot)$ transformation.

Theorems & Definitions (13)

  • Proposition 1: Sparse trimming
  • Proposition 2: Breakdown point
  • Definition 1: Robustly strong oracle property
  • Proposition 3: Necessary condition for SFSOD consistency
  • Proposition 4: Oracle reconstruction for MIQP
  • Proposition 5: MIQP robustly strong oracle property
  • Proposition 6: Optimal parameter estimation
  • proof : Proof of Proposition \ref{['lemma:1']}.
  • proof : Proof of Proposition \ref{['th:bdp']}.
  • proof : Proof of Proposition \ref{['th:necessCond']}.
  • ...and 3 more