Table of Contents
Fetching ...

Building Bridges between Regression, Clustering, and Classification

Lawrence Stewart, Francis Bach, Quentin Berthet

TL;DR

The paper addresses the challenge that regression with squared loss can underperform neural models and shows that reframing regression as classification via a learnable target encoder–decoder improves training dynamics and predictions. By mapping targets to distributions over $k$ classes on the simplex and decoding with a linear head, the approach blends discrete and continuous representations; variants include hard and soft binning, pre-trained encoders, and an end-to-end joint objective that balances auto-encoding, KL alignment, and regression loss. Across eight real-world datasets, soft-binning generally outperforms hard binning, and the end-to-end joint training achieves the best performance, with reported gains up to 25% over least-squares on average. The method offers improved predictive accuracy, interpretable decoders, and a flexible framework to interpolate between regression and classification objectives, with practical implications for regression tasks in diverse domains.

Abstract

Regression, the task of predicting a continuous scalar target y based on some features x is one of the most fundamental tasks in machine learning and statistics. It has been observed and theoretically analyzed that the classical approach, meansquared error minimization, can lead to suboptimal results when training neural networks. In this work, we propose a new method to improve the training of these models on regression tasks, with continuous scalar targets. Our method is based on casting this task in a different fashion, using a target encoder, and a prediction decoder, inspired by approaches in classification and clustering. We showcase the performance of our method on a wide range of real-world datasets.

Building Bridges between Regression, Clustering, and Classification

TL;DR

The paper addresses the challenge that regression with squared loss can underperform neural models and shows that reframing regression as classification via a learnable target encoder–decoder improves training dynamics and predictions. By mapping targets to distributions over classes on the simplex and decoding with a linear head, the approach blends discrete and continuous representations; variants include hard and soft binning, pre-trained encoders, and an end-to-end joint objective that balances auto-encoding, KL alignment, and regression loss. Across eight real-world datasets, soft-binning generally outperforms hard binning, and the end-to-end joint training achieves the best performance, with reported gains up to 25% over least-squares on average. The method offers improved predictive accuracy, interpretable decoders, and a flexible framework to interpolate between regression and classification objectives, with practical implications for regression tasks in diverse domains.

Abstract

Regression, the task of predicting a continuous scalar target y based on some features x is one of the most fundamental tasks in machine learning and statistics. It has been observed and theoretically analyzed that the classical approach, meansquared error minimization, can lead to suboptimal results when training neural networks. In this work, we propose a new method to improve the training of these models on regression tasks, with continuous scalar targets. Our method is based on casting this task in a different fashion, using a target encoder, and a prediction decoder, inspired by approaches in classification and clustering. We showcase the performance of our method on a wide range of real-world datasets.

Paper Structure

This paper contains 22 sections, 21 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Framework description. Our framework is based on a target encoder$\psi_w$ (in red) that yields for each $y$ an encoded distribution $\psi_w(y)$ over $k$ classes. A classification model$\pi_\theta = \mathop{\mathrm{\text{softmax}}}\limits(g_\theta)$ is trained with a KL objective on this distribution. A decoder model$\mu$ (in blue) decodes this distribution in the target space $\mathbb{R}^m$. The target encoder and decoder can be trained using an autoenconding loss, as well as a joint end-to-end objective (see Section \ref{['sec:methods']}).
  • Figure 2: Embedding and binning the target space $\mathbb{R}^m$ (here $m=2$) into $\Delta_k$ (here $k=9$), for both a fixed grid of encoders (Top) and a learnt encoder (Bottom). For both cases we display the encoders, including an highlighted one, for a fixed $i \in [k]$ and a target $y \in \mathbb{R}^m$ (blue cross). We illustrate first hard binning (Left) where $y$ (and any $y$ in the same highlighted region) is assigned to one class (via a one-hot), and soft binning both with the contour plot of $\psi_w(\cdot)_i$ for one $i \in [k]$ (Center), and $\psi_w(y)$ as a distribution in $\Delta_k$ (Right).
  • Figure 3: Experimental results across datasets. We report for all methods the test root mean squared error (rMSE), over 8 different datasets (see Datasets above), for the 6 methods listed in Section \ref{['sec:expe-methods']}, all for $k=25$. They are displayed in each of the 8 groups from left to right. All are normalized to the error of the first baseline: least squares is set to 1.0 in each dataset, and the others proportionally.
  • Figure 4: Average experimental results on average over all datasets. We observe an overall hierarchy between the different methods considered.
  • Figure 5: Impact of different architecture and training hyperparameters on the performance of the methods. Top: for the soft-binning approach, the impact of $k$ for values between $3$ and $45$. Bottom: for the end-to-end approach, the impact of the value of $\lambda_{\text{\sf KL}}$ on the final value.
  • ...and 2 more figures