Table of Contents
Fetching ...

High-dimensional prediction for count response via sparse exponential weights

The Tien Mai

Abstract

Count data is prevalent in various fields like ecology, medical research, and genomics. In high-dimensional settings, where the number of features exceeds the sample size, feature selection becomes essential. While frequentist methods like Lasso have advanced in handling high-dimensional count data, Bayesian approaches remain under-explored with no theoretical results on prediction performance. This paper introduces a novel probabilistic machine learning framework for high-dimensional count data prediction. We propose a pseudo-Bayesian method that integrates a scaled Student prior to promote sparsity and uses an exponential weight aggregation procedure. A key contribution is a novel risk measure tailored to count data prediction, with theoretical guarantees for prediction risk using PAC-Bayesian bounds. Our results include non-asymptotic oracle inequalities, demonstrating rate-optimal prediction error without prior knowledge of sparsity. We implement this approach efficiently using Langevin Monte Carlo method. Simulations and a real data application highlight the strong performance of our method compared to the Lasso in various settings.

High-dimensional prediction for count response via sparse exponential weights

Abstract

Count data is prevalent in various fields like ecology, medical research, and genomics. In high-dimensional settings, where the number of features exceeds the sample size, feature selection becomes essential. While frequentist methods like Lasso have advanced in handling high-dimensional count data, Bayesian approaches remain under-explored with no theoretical results on prediction performance. This paper introduces a novel probabilistic machine learning framework for high-dimensional count data prediction. We propose a pseudo-Bayesian method that integrates a scaled Student prior to promote sparsity and uses an exponential weight aggregation procedure. A key contribution is a novel risk measure tailored to count data prediction, with theoretical guarantees for prediction risk using PAC-Bayesian bounds. Our results include non-asymptotic oracle inequalities, demonstrating rate-optimal prediction error without prior knowledge of sparsity. We implement this approach efficiently using Langevin Monte Carlo method. Simulations and a real data application highlight the strong performance of our method compared to the Lasso in various settings.

Paper Structure

This paper contains 21 sections, 13 theorems, 94 equations, 4 tables.

Key Result

Theorem 3.1

Assume that Assumption assume_bounded_loss, assume_X_bounded and assume_Lipschitz are satisfied. Take $\lambda= \sqrt{n}$, $\varsigma = ( C_L C_{\rm x} n\sqrt{d})^{-1}$. Then for all $\theta^*$ such that $\| \theta^*\|_1 \leq C_1 - 2d\varsigma$ we have that and with probability at least $1-\varepsilon, \varepsilon\in (0,1)$ that for some constant $\mathcal{C}_1 , \mathcal{C}'_1 > 0$ depending on

Theorems & Definitions (23)

  • Theorem 3.1
  • Corollary 3.1
  • Proposition 3.1
  • Theorem 3.2
  • Remark 1
  • Corollary 3.2
  • Proposition 3.2
  • Remark 2
  • Theorem 3.3
  • proof : Proof for Theorem \ref{['thm_main_2']}
  • ...and 13 more