EXACT: How to Train Your Accuracy

Ivan Karpukhin; Stanislav Dereka; Sergey Kolesnikov

EXACT: How to Train Your Accuracy

Ivan Karpukhin, Stanislav Dereka, Sergey Kolesnikov

TL;DR

The paper introduces EXACT, a framework for directly maximizing the expected accuracy of a stochastic classifier. By modeling the score vector as $s \sim \mathcal{N}(\mu(x), \sigma^2(x) I)$ and optimizing $\mathcal{A}(\theta) = \mathbb{E}_{x,y} \mathrm{P}(s_y > \max_{i \neq y} s_i)$ via gradient methods, it overcomes the non-differentiability of accuracy. The method relies on efficient evaluation and differentiation of an orthant integral of a multivariate normal distribution, using the Genz algorithm and careful handling of margins, variance scheduling, and gradient normalization. Empirical results on tabular datasets and deep image tasks show that EXACT can yield higher or competitive accuracy than cross-entropy and hinge losses, with modest computational overhead that scales favorably with model complexity and number of classes. This approach offers a principled route to direct metric optimization and potential applicability to other non-differentiable targets.

Abstract

Classification tasks are usually evaluated in terms of accuracy. However, accuracy is discontinuous and cannot be directly optimized using gradient ascent. Popular methods minimize cross-entropy, hinge loss, or other surrogate losses, which can lead to suboptimal results. In this paper, we propose a new optimization framework by introducing stochasticity to a model's output and optimizing expected accuracy, i.e. accuracy of the stochastic model. Extensive experiments on linear models and deep image classification show that the proposed optimization method is a powerful alternative to widely used classification losses.

EXACT: How to Train Your Accuracy

TL;DR

The paper introduces EXACT, a framework for directly maximizing the expected accuracy of a stochastic classifier. By modeling the score vector as

and optimizing

via gradient methods, it overcomes the non-differentiability of accuracy. The method relies on efficient evaluation and differentiation of an orthant integral of a multivariate normal distribution, using the Genz algorithm and careful handling of margins, variance scheduling, and gradient normalization. Empirical results on tabular datasets and deep image tasks show that EXACT can yield higher or competitive accuracy than cross-entropy and hinge losses, with modest computational overhead that scales favorably with model complexity and number of classes. This approach offers a principled route to direct metric optimization and potential applicability to other non-differentiable targets.

Abstract

Paper Structure (36 sections, 4 theorems, 40 equations, 10 figures, 9 tables)

This paper contains 36 sections, 4 theorems, 40 equations, 10 figures, 9 tables.

Introduction
Related Work
Classification Losses
Surrogate Losses Beyond Accuracy
Stochastic Prediction
Motivation
EXACT
Definitions
Stochastic Model's Accuracy
Optimization
Inference
Improvements
Margin
Ratio Ambiguity
Variance Scheduler
...and 21 more sections

Key Result

Theorem 4.2

Suppose the scores vector $s$ is distributed according to multivariate normal distribution $\mathcal{N}(\mu, \sigma^2 I)$ in $\mathbb{R}^C$. In this case, the probability of the $y$-th score exceeding other scores can be represented as where $\mathcal{N}(t; \mu, \Sigma)$ denotes multivariate normal PDF, $D_y$ is a delta matrix of the order $C$ for the label $y$ and $\Omega_+: \{t \in \mathbb{R}^{

Figures (10)

Figure 1: The toy example, which demonstrates importance of accuracy optimization. The model consists of a single bias parameter (decision threshold), while scaling weight is assumed to be 1. EXACT achieves 100% accuracy, while cross-entropy and hinge loss misclassify one element.
Figure 2: EXACT training pipeline. The model predicts the mean and variance of the logit vector. EXACT's training objective estimates accuracy, which is differentiable for the stochastic model.
Figure 3: Dependency of the expected accuracy on the model parameter in our toy example for different values of $\sigma$.
Figure 4: EXACT loss dependency on the model parameter with and w/o margin. Margin affects training with large $\sigma$, creating a better optimization landscape in early epochs.
Figure 5: Gradient norm during training on CIFAR-100 for different loss functions.
...and 5 more figures

Theorems & Definitions (7)

Definition 4.1
Theorem 4.2
Theorem 4.3
Theorem 4.2
proof
Theorem 4.3
proof

EXACT: How to Train Your Accuracy

TL;DR

Abstract

EXACT: How to Train Your Accuracy

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (7)