$f$-Divergence Based Classification: Beyond the Use of Cross-Entropy

Nicola Novello; Andrea M. Tonello

$f$-Divergence Based Classification: Beyond the Use of Cross-Entropy

Nicola Novello, Andrea M. Tonello

TL;DR

The paper addresses the limitation of cross-entropy by reframing classification as a MAP problem and learning the posterior $p_{X|Y}$ through $f$-divergence–based objectives. It introduces two learning paradigms: a top-down variational approach that uses the $f$-divergence bound and a bottom-up discriminative approach that yields flexible objective designs, including a novel shifted-log (SL) divergence. The authors derive theoretical guarantees for posterior estimation, propose two discriminator architectures (unsupervised and supervised), and validate the method across toy problems, image datasets, and communication-channel decoding, showing that SL often achieves the highest accuracy and faster convergence. These contributions offer a principled, flexible framework for posterior estimation in classification with practical impact on robust learning across domains.

Abstract

In deep learning, classification tasks are formalized as optimization problems often solved via the minimization of the cross-entropy. However, recent advancements in the design of objective functions allow the usage of the $f$-divergence to generalize the formulation of the optimization problem for classification. We adopt a Bayesian perspective and formulate the classification task as a maximum a posteriori probability problem. We propose a class of objective functions based on the variational representation of the $f$-divergence. Furthermore, driven by the challenge of improving the state-of-the-art approach, we propose a bottom-up method that leads us to the formulation of an objective function corresponding to a novel $f$-divergence referred to as shifted log (SL). We theoretically analyze the objective functions proposed and numerically test them in three application scenarios: toy examples, image datasets, and signal detection/decoding problems. The analyzed scenarios demonstrate the effectiveness of the proposed approach and that the SL divergence achieves the highest classification accuracy in almost all the considered cases.

$f$-Divergence Based Classification: Beyond the Use of Cross-Entropy

TL;DR

The paper addresses the limitation of cross-entropy by reframing classification as a MAP problem and learning the posterior

through

-divergence–based objectives. It introduces two learning paradigms: a top-down variational approach that uses the

-divergence bound and a bottom-up discriminative approach that yields flexible objective designs, including a novel shifted-log (SL) divergence. The authors derive theoretical guarantees for posterior estimation, propose two discriminator architectures (unsupervised and supervised), and validate the method across toy problems, image datasets, and communication-channel decoding, showing that SL often achieves the highest accuracy and faster convergence. These contributions offer a principled, flexible framework for posterior estimation in classification with practical impact on robust learning across domains.

Abstract

-divergence to generalize the formulation of the optimization problem for classification. We adopt a Bayesian perspective and formulate the classification task as a maximum a posteriori probability problem. We propose a class of objective functions based on the variational representation of the

-divergence. Furthermore, driven by the challenge of improving the state-of-the-art approach, we propose a bottom-up method that leads us to the formulation of an objective function corresponding to a novel

-divergence referred to as shifted log (SL). We theoretically analyze the objective functions proposed and numerically test them in three application scenarios: toy examples, image datasets, and signal detection/decoding problems. The analyzed scenarios demonstrate the effectiveness of the proposed approach and that the SL divergence achieves the highest classification accuracy in almost all the considered cases.

Paper Structure (45 sections, 18 theorems, 90 equations, 7 figures, 6 tables)

This paper contains 45 sections, 18 theorems, 90 equations, 7 figures, 6 tables.

Introduction
MAP-based Classification Through Posterior Probability Learning
Posterior Probability Learning Through the Exploitation of $f$-Divergence
$f$-Divergence
Posterior Estimation Through the Variational Representation of the $f$-Divergence
Bottom-Up Posterior Probability Learning
Shifted Log Objective Function and Divergence
Remarks on the New Objective Function and $f$-Divergence
Comparison Between SL and GAN Divergences
Discriminator Architecture
Unsupervised Architecture
Supervised Architecture
Results
Implementation Details
Image Datasets Classification
...and 30 more sections

Key Result

Theorem 3.1

Let $X$ and $Y$ be the random vectors with pdfs $p_X(\mathbf{x})$ and $p_Y(\mathbf{y})$, respectively. Assume $\mathbf{y} = H(\mathbf{x})$, where $H(\cdot)$ is a stochastic function, then $p_{XY}(\mathbf{x}, \mathbf{y})$ is the joint density. Define $\mathcal{T}_x$ to be the support of $X$ and $p_U( Then, leads to the estimation of the posterior density where $T^{\diamond}(\mathbf{x},\mathbf{y})

Figures (7)

Figure 1: System model representation. $X$ is the input of a stochastic model $H(\cdot)$, while the output is $Y$. In the example represented by $H_1(\cdot)$, the input is the class element "dog" and the output is an image of a dog. Differently, $H_2(\cdot)$ is a communication channel, therefore the input is a codeword, and the output is the binary representation of such a codeword after the noise addition.
Figure 2: Diagrams of unsupervised and supervised architectures. The thick rectangle delineates the discriminator in Fig. \ref{['fig:system-model']}. The trapezoidal shape represents the neural network architecture.
Figure 3: SER achieved by using a 4-PAM modulation over a nonlinear communication channel.
Figure 4: Continuous posterior density estimation for various $f$-divergences. The results of the Exponential task is represented in the upper row, while the outcomes of the Gaussian task are depicted in the lower row. The true posterior density is the first plot of each row.
Figure 5: Convergence speed of the test accuracy over $200$ training epochs.
...and 2 more figures

Theorems & Definitions (29)

Theorem 3.1
Lemma 3.2
Theorem 4.1
Theorem 5.1
Corollary 5.2
Corollary 5.4
Theorem 6.1
Lemma 2.1
proof
Lemma 2.2
...and 19 more

$f$-Divergence Based Classification: Beyond the Use of Cross-Entropy

TL;DR

Abstract

$f$-Divergence Based Classification: Beyond the Use of Cross-Entropy

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (29)