Table of Contents
Fetching ...

The Fisher-Rao Loss for Learning under Label Noise

Henrique K. Miyamoto, Fábio C. C. Meneghetti, Sueli I. R. Costa

TL;DR

This work introduces the Fisher–Rao loss for classification, derived from the Fisher information geometry of the discrete probability simplex, and defines $L_{FR}(y,f(x))=(\arccos(\sqrt{p_y}))^2$. It provides robustness bounds under uniform label noise and analyzes learning speed, revealing a trade-off: FR offers noise resilience similar to Hellinger while enabling faster convergence than MAE, and more robust behavior than CE under noise. Theoretical relations show $L_{MAE} \le L_{H} \le L_{FR} \le L_{CE}$ with $L_{FR}$ asymptotically aligning with $L_{q-CE}$ for small losses, situating FR as a bridge between robust and efficient losses. Empirical results on synthetic data and MNIST corroborate the theoretical findings, indicating that FR delivers improved robustness to label noise without sacrificing performance on clean data, motivating further exploration for regression settings and broader loss designs.

Abstract

Choosing a suitable loss function is essential when learning by empirical risk minimisation. In many practical cases, the datasets used for training a classifier may contain incorrect labels, which prompts the interest for using loss functions that are inherently robust to label noise. In this paper, we study the Fisher-Rao loss function, which emerges from the Fisher-Rao distance in the statistical manifold of discrete distributions. We derive an upper bound for the performance degradation in the presence of label noise, and analyse the learning speed of this loss. Comparing with other commonly used losses, we argue that the Fisher-Rao loss provides a natural trade-off between robustness and training dynamics. Numerical experiments with synthetic and MNIST datasets illustrate this performance.

The Fisher-Rao Loss for Learning under Label Noise

TL;DR

This work introduces the Fisher–Rao loss for classification, derived from the Fisher information geometry of the discrete probability simplex, and defines . It provides robustness bounds under uniform label noise and analyzes learning speed, revealing a trade-off: FR offers noise resilience similar to Hellinger while enabling faster convergence than MAE, and more robust behavior than CE under noise. Theoretical relations show with asymptotically aligning with for small losses, situating FR as a bridge between robust and efficient losses. Empirical results on synthetic data and MNIST corroborate the theoretical findings, indicating that FR delivers improved robustness to label noise without sacrificing performance on clean data, motivating further exploration for regression settings and broader loss designs.

Abstract

Choosing a suitable loss function is essential when learning by empirical risk minimisation. In many practical cases, the datasets used for training a classifier may contain incorrect labels, which prompts the interest for using loss functions that are inherently robust to label noise. In this paper, we study the Fisher-Rao loss function, which emerges from the Fisher-Rao distance in the statistical manifold of discrete distributions. We derive an upper bound for the performance degradation in the presence of label noise, and analyse the learning speed of this loss. Comparing with other commonly used losses, we argue that the Fisher-Rao loss provides a natural trade-off between robustness and training dynamics. Numerical experiments with synthetic and MNIST datasets illustrate this performance.
Paper Structure (11 sections, 5 theorems, 53 equations, 5 figures, 4 tables)

This paper contains 11 sections, 5 theorems, 53 equations, 5 figures, 4 tables.

Key Result

Proposition 1

Let $L_\mathrm{MAE}$, $L_\mathrm{CE}$, $L_{\text{$q$-}\mathrm{CE}}$, $L_\mathrm{FR}$ and $L_\mathrm{H}$ denote the loss functions defined, respectively, in eq:mae-loss, eq:ce-loss, eq:q-loss, eq:fr-loss and eq:h-loss.

Figures (5)

  • Figure 1: Bounds $A(K,\eta)$ and $B(K,\eta)$, for $K=10$ and $\eta = \alpha \left(1-\frac{1}{K}\right)$, as function of $\alpha \in \left[0,1\right)$.
  • Figure 2: Bounds $A(K,\eta)$ and $B(K,\eta)$ for $\eta = 0.8 \left(1-\frac{1}{K}\right)$, as function of ${2 \le K \le 300}$.
  • Figure 3: Functions $h(p_y)$ and their derivatives $\vert h'(p_y) \vert$ for different loss functions, cf. Table \ref{['tab:functions-h']}.
  • Figure 4: Training (dashed lines) and test (solid lines) accuracy for synthetic dataset.
  • Figure 5: Training (dashed lines) and test (solid lines) accuracy for MNIST dataset.

Theorems & Definitions (10)

  • Proposition 1
  • proof
  • Lemma 2
  • proof
  • Proposition 3
  • proof
  • Lemma 4
  • proof
  • Proposition 5
  • proof