Table of Contents
Fetching ...

Remarks on Loss Function of Threshold Method for Ordinal Regression Problem

Ryoya Yamasaki, Toshiyuki Tanaka

TL;DR

This work investigates why threshold-based ordinal regression methods succeed or fail under varying data distributions and learning procedures. It analyzes all-threshold, immediate-threshold, and piecewise-linear losses, deriving conditions under which surrogate-risk minimization yields Bayes-optimal classifiers (notably under CL/ACL models) and identifying failure modes when data are highly heteroscedastic or non-unimodal. Through extensive simulations, synthesis data, and real-world age-estimation tasks, the authors show that non-PL losses often achieve stronger approximation performance on unimodal data, while PL and IT-based approaches can cause learned 1DT values to concentrate at a few points, degrading performance in larger-scale or multimodal settings. The findings highlight how the choice of loss, bias-structure, and optimization strategy shape the approximation error and thus practical performance, offering guidance for designing more robust threshold-based ordinal regression methods.

Abstract

Threshold methods are popular for ordinal regression problems, which are classification problems for data with a natural ordinal relation. They learn a one-dimensional transformation (1DT) of observations of the explanatory variable, and then assign label predictions to the observations by thresholding their 1DT values. In this paper, we study the influence of the underlying data distribution and of the learning procedure of the 1DT on the classification performance of the threshold method via theoretical considerations and numerical experiments. Consequently, for example, we found that threshold methods based on typical learning procedures may perform poorly when the probability distribution of the target variable conditioned on an observation of the explanatory variable tends to be non-unimodal. Another instance of our findings is that learned 1DT values are concentrated at a few points under the learning procedure based on a piecewise-linear loss function, which can make difficult to classify data well.

Remarks on Loss Function of Threshold Method for Ordinal Regression Problem

TL;DR

This work investigates why threshold-based ordinal regression methods succeed or fail under varying data distributions and learning procedures. It analyzes all-threshold, immediate-threshold, and piecewise-linear losses, deriving conditions under which surrogate-risk minimization yields Bayes-optimal classifiers (notably under CL/ACL models) and identifying failure modes when data are highly heteroscedastic or non-unimodal. Through extensive simulations, synthesis data, and real-world age-estimation tasks, the authors show that non-PL losses often achieve stronger approximation performance on unimodal data, while PL and IT-based approaches can cause learned 1DT values to concentrate at a few points, degrading performance in larger-scale or multimodal settings. The findings highlight how the choice of loss, bias-structure, and optimization strategy shape the approximation error and thus practical performance, offering guidance for designing more robust threshold-based ordinal regression methods.

Abstract

Threshold methods are popular for ordinal regression problems, which are classification problems for data with a natural ordinal relation. They learn a one-dimensional transformation (1DT) of observations of the explanatory variable, and then assign label predictions to the observations by thresholding their 1DT values. In this paper, we study the influence of the underlying data distribution and of the learning procedure of the 1DT on the classification performance of the threshold method via theoretical considerations and numerical experiments. Consequently, for example, we found that threshold methods based on typical learning procedures may perform poorly when the probability distribution of the target variable conditioned on an observation of the explanatory variable tends to be non-unimodal. Another instance of our findings is that learned 1DT values are concentrated at a few points under the learning procedure based on a piecewise-linear loss function, which can make difficult to classify data well.
Paper Structure (54 sections, 10 theorems, 75 equations, 7 figures, 13 tables)

This paper contains 54 sections, 10 theorems, 75 equations, 7 figures, 13 tables.

Key Result

Theorem 1

Let ${\mathcal{A}}\subseteq\{a:{\mathbb{R}}^d\to{\mathbb{R}}\}$ and ${\mathcal{B}}_0^{\rm ord}\subseteq{\mathcal{B}}\subseteq{\mathcal{B}}_0$ and introduce the conditions Then, it holds that

Figures (7)

  • Figure 1: Instances of the 10-dimensional unimodal PMF ${\bm{p}}=(p_k)_{k\in[10]}$ with the mode 5.
  • Figure 2: Instances of the CL model of $K=10$. Figures show $P_{\rm cl}(y;u,{\bm{b}})$ for $y=1,\ldots,10$, $u\in[-\Delta,9\Delta]$, ${\bm{b}}={\bm{b}}^{[\Delta]}$ (plates (\ref{['d1']})--(\ref{['d3']})), $\acute{{\bm{b}}}^{[\Delta]}$ (plates (\ref{['e1']})--(\ref{['e3']})), and $\Delta=1/3,1,3$. At $u$ in the region where the background color is white or gray, the PMF $(P_{\rm cl}(y;u,{\bm{b}}))_{y\in[10]}$ is unimodal or not.
  • Figure 3: Instances of the ACL model of $K=10$. Figures show $P_{\rm acl}(y;u,{\bm{b}})$ for $y=1,\ldots,10$, $u\in[-\Delta,9\Delta]$, ${\bm{b}}={\bm{b}}^{[\Delta]}$ (plates (\ref{['h1']})--(\ref{['h3']})), $\acute{{\bm{b}}}^{[\Delta]}$ (plates (\ref{['i1']})--(\ref{['i3']})), $\grave{{\bm{b}}}^{[\Delta]}$ (plates (\ref{['j1']})--(\ref{['j3']})), and $\Delta=1/3,1,3$. At $u$ in the region where the background color is white or gray, the PMF $(P_{\rm acl}(y;u,{\bm{b}}))_{y\in[10]}$ is unimodal or not.
  • Figure 4: Phase transition of the surrogate risk minimizer for Example \ref{['ex:3hingit']} with $(p_1,p_2)=(p_4,p_3)=(0,0.5),(0.1,0.4),\ldots,(0.5,0)$. The surrogate risk minimizer for $(q_1,q_2,q_3)$ in the red (resp. blue) region is in the phase-1 (resp. phase-2). For example, the CPD has a small scale for $(q_1,q_2,q_3)\approx(0,0,1)$ or a large scale for $(q_1,q_2,q_3)\approx(1/3,1/3,1/3)$.
  • Figure 5: Learned 1DT value $\bar{a}_i$ and optimal label prediction $\tilde{f}_{\ell,i}$ (color of point) versus $i$, and method's label prediction (lightened background color), which represents $\bar{f}_{\ell,i}$, for the simulation in Section \ref{['sec:SS']} with Task-Z. For example, 3 plates of (\ref{['m1']}) show 3 results for H-1 with logi-AT-O, -IT-N, and -IT-O from top to bottom.
  • ...and 2 more figures

Theorems & Definitions (21)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • Theorem 7
  • Definition 1
  • Theorem 8
  • Example 1
  • ...and 11 more