Table of Contents
Fetching ...

Misclassification excess risk bounds for PAC-Bayesian classification via convexified loss

The Tien Mai

TL;DR

The paper addresses the problem of bounding misclassification excess risk for PAC-Bayesian classification when using a convex surrogate loss. It develops a PAC-Bayes relative bound in expectation under a low-noise/margin condition, showing that with the Gibbs posterior and a calibrated temperature, the misclassification risk can be controlled by a combination of the convex-loss excess and a KL-divergence penalty, yielding fast n^{-1} rates. The framework is illustrated through two nontrivial applications: high-dimensional sparse classification with a sparsity-promoting prior achieving s^* log(d/s^*)/n bounds, and 1-bit matrix completion with a low-rank prior achieving r(d_1+d_2)/n bounds (up to log factors), both minimax-optimal. These results extend PAC-Bayesian theory beyond risk bounds for convexified losses to direct misclassification risk, providing practical guarantees for convex PAC-Bayesian methods in classification tasks with large or structured parameter spaces.

Abstract

PAC-Bayesian bounds have proven to be a valuable tool for deriving generalization bounds and for designing new learning algorithms in machine learning. However, it typically focus on providing generalization bounds with respect to a chosen loss function. In classification tasks, due to the non-convex nature of the 0-1 loss, a convex surrogate loss is often used, and thus current PAC-Bayesian bounds are primarily specified for this convex surrogate. This work shifts its focus to providing misclassification excess risk bounds for PAC-Bayesian classification when using a convex surrogate loss. Our key ingredient here is to leverage PAC-Bayesian relative bounds in expectation rather than relying on PAC-Bayesian bounds in probability. We demonstrate our approach in several important applications.

Misclassification excess risk bounds for PAC-Bayesian classification via convexified loss

TL;DR

The paper addresses the problem of bounding misclassification excess risk for PAC-Bayesian classification when using a convex surrogate loss. It develops a PAC-Bayes relative bound in expectation under a low-noise/margin condition, showing that with the Gibbs posterior and a calibrated temperature, the misclassification risk can be controlled by a combination of the convex-loss excess and a KL-divergence penalty, yielding fast n^{-1} rates. The framework is illustrated through two nontrivial applications: high-dimensional sparse classification with a sparsity-promoting prior achieving s^* log(d/s^*)/n bounds, and 1-bit matrix completion with a low-rank prior achieving r(d_1+d_2)/n bounds (up to log factors), both minimax-optimal. These results extend PAC-Bayesian theory beyond risk bounds for convexified losses to direct misclassification risk, providing practical guarantees for convex PAC-Bayesian methods in classification tasks with large or structured parameter spaces.

Abstract

PAC-Bayesian bounds have proven to be a valuable tool for deriving generalization bounds and for designing new learning algorithms in machine learning. However, it typically focus on providing generalization bounds with respect to a chosen loss function. In classification tasks, due to the non-convex nature of the 0-1 loss, a convex surrogate loss is often used, and thus current PAC-Bayesian bounds are primarily specified for this convex surrogate. This work shifts its focus to providing misclassification excess risk bounds for PAC-Bayesian classification when using a convex surrogate loss. Our key ingredient here is to leverage PAC-Bayesian relative bounds in expectation rather than relying on PAC-Bayesian bounds in probability. We demonstrate our approach in several important applications.
Paper Structure (13 sections, 7 theorems, 54 equations)

This paper contains 13 sections, 7 theorems, 54 equations.

Key Result

Theorem 1

Assuming that Assumptions assume_boundedloss, assume_Lipschitz and dfnbernstein are satisfied, let's take $\lambda = n/\overline{C}$. Then we have:

Theorems & Definitions (22)

  • Remark 1
  • Theorem 1
  • Remark 2
  • Theorem 2
  • Remark 3
  • Corollary 1
  • Remark 4
  • Example 1: Finite case
  • Example 2
  • Theorem 3
  • ...and 12 more