Table of Contents
Fetching ...

Investigating the Histogram Loss in Regression

Ehsan Imani, Kai Luedemann, Sam Scholnick-Hughes, Esraa Elelimy, Martha White

TL;DR

This paper investigates a recent approach to regression, the Histogram Loss, which involves learning the conditional distribution of the target variable by minimizing the cross-entropy between a target distribution and a flexible histogram prediction, and demonstrates the viability of the Histogram Loss in common deep learning applications without a need for costly hyperparameter tuning.

Abstract

It is becoming increasingly common in regression to train neural networks that model the entire distribution even if only the mean is required for prediction. This additional modeling often comes with performance gain and the reasons behind the improvement are not fully known. This paper investigates a recent approach to regression, the Histogram Loss, which involves learning the conditional distribution of the target variable by minimizing the cross-entropy between a target distribution and a flexible histogram prediction. We design theoretical and empirical analyses to determine why and when this performance gain appears, and how different components of the loss contribute to it. Our results suggest that the benefits of learning distributions in this setup come from improvements in optimization rather than modelling extra information. We then demonstrate the viability of the Histogram Loss in common deep learning applications without a need for costly hyperparameter tuning.

Investigating the Histogram Loss in Regression

TL;DR

This paper investigates a recent approach to regression, the Histogram Loss, which involves learning the conditional distribution of the target variable by minimizing the cross-entropy between a target distribution and a flexible histogram prediction, and demonstrates the viability of the Histogram Loss in common deep learning applications without a need for costly hyperparameter tuning.

Abstract

It is becoming increasingly common in regression to train neural networks that model the entire distribution even if only the mean is required for prediction. This additional modeling often comes with performance gain and the reasons behind the improvement are not fully known. This paper investigates a recent approach to regression, the Histogram Loss, which involves learning the conditional distribution of the target variable by minimizing the cross-entropy between a target distribution and a flexible histogram prediction. We design theoretical and empirical analyses to determine why and when this performance gain appears, and how different components of the loss contribute to it. Our results suggest that the benefits of learning distributions in this setup come from improvements in optimization rather than modelling extra information. We then demonstrate the viability of the Histogram Loss in common deep learning applications without a need for costly hyperparameter tuning.
Paper Structure (42 sections, 6 theorems, 41 equations, 24 figures, 9 tables)

This paper contains 42 sections, 6 theorems, 41 equations, 24 figures, 9 tables.

Key Result

Proposition 0

Assume $\mathbf{x}, y$ are fixed, giving fixed coefficients $c_i$ in HL-Gaussian. Let $h_i(\mathbf{x})$ be as in eq_softmax, defined by the parameters $\mathbf{w} = \{\mathbf{w}_1, \ldots, \mathbf{w}_k\}$ and ${\boldsymbol{\theta}}$, providing the predicted distribution $h_\mathbf{x}$. Assume for al Then the norm of the gradient for HL-Gaussian, w.r.t. all the parameters in the network $\{{\boldsy

Figures (24)

  • Figure 1: Training with HL. The red curve is the target distribution and the blue histogram is the prediction distribution. The neural network is trained to minimize the cross-entropy between the two.
  • Figure 2: (Top) A sample histogram, and (bottom) a neural network with a softmax output layer that represents a histogram.
  • Figure 3: The behavior of the bound in Equation \ref{['eq:predbound']} on an example with Dirac delta functions. LHS shows the left-hand side, and the other two curves show the bounds obtained with Pinsker's and BH inequality.
  • Figure 4: (Left) A synthetic task with freqeuncy 10 and offset 0. The blue dots show the training points and the green and red curve show the predictions obtained after training with HL-Gaussian and $\ell_2$ respectively. (Middle) Learning curves for different frequencies in $\{1, 10, 20\}$ and fixed offset $0$. Green and red denote HL-Gaussian and $\ell_2$ and brighter shades denote higher frequencies. Each curve is averaged over 5 runs with different random initializations. (Right) Learning curves for different offsets in $\{0, 1, 10\}$ and fixed freqency $10$. Brighter shades show higher offsets. Altogether, HL-Gaussian trains remarkably faster than $\ell_2$ when the target has high frequency or is far from zero.
  • Figure 5: HL-Uniform results on CT Scan. Dotted and solid lines show train and test errors respectively. The parameter $\epsilon$ is the weighting on the uniform distribution and raising it only impaired performance.
  • ...and 19 more figures

Theorems & Definitions (6)

  • Proposition 0: Local Lipschitz constant for HL-Gaussian
  • Proposition 0: Bias Characterization
  • Proposition 0: Bound on Prediction Error
  • Proposition 0: Local Lipschitz constant for HL-Gaussian
  • Proposition 0: Bias Characterization
  • Proposition 0: Bound on Prediction Error