Table of Contents
Fetching ...

Interval Regression: A Comparative Study with Proposed Models

Tung L Nguyen, Toby Dylan Hocking

TL;DR

Interval-valued targets complicate regression tasks; this paper provides a comprehensive review of existing interval regression models and introduces three proposed approaches (KNN, MLP, MMIF). It systematically evaluates seven models across real-world and synthetic datasets using hinge-based loss, highlighting that MMIF generally achieves the best performance and consistency, with MMIT as a strong, lighter alternative. The AFT model in XGBoost exhibits limitations, particularly with left-censoring, and may require careful preprocessing. The study offers practical guidance for model selection in interval regression and contributes open-source code to support reproducible research.

Abstract

Regression models are essential for a wide range of real-world applications. However, in practice, target values are not always precisely known; instead, they may be represented as intervals of acceptable values. This challenge has led to the development of Interval Regression models. In this study, we provide a comprehensive review of existing Interval Regression models and introduce alternative models for comparative analysis. Experiments are conducted on both real-world and synthetic datasets to offer a broad perspective on model performance. The results demonstrate that no single model is universally optimal, highlighting the importance of selecting the most suitable model for each specific scenario.

Interval Regression: A Comparative Study with Proposed Models

TL;DR

Interval-valued targets complicate regression tasks; this paper provides a comprehensive review of existing interval regression models and introduces three proposed approaches (KNN, MLP, MMIF). It systematically evaluates seven models across real-world and synthetic datasets using hinge-based loss, highlighting that MMIF generally achieves the best performance and consistency, with MMIT as a strong, lighter alternative. The AFT model in XGBoost exhibits limitations, particularly with left-censoring, and may require careful preprocessing. The study offers practical guidance for model selection in interval regression and contributes open-source code to support reproducible research.

Abstract

Regression models are essential for a wide range of real-world applications. However, in practice, target values are not always precisely known; instead, they may be represented as intervals of acceptable values. This challenge has led to the development of Interval Regression models. In this study, we provide a comprehensive review of existing Interval Regression models and introduce alternative models for comparative analysis. Experiments are conducted on both real-world and synthetic datasets to offer a broad perspective on model performance. The results demonstrate that no single model is universally optimal, highlighting the importance of selecting the most suitable model for each specific scenario.

Paper Structure

This paper contains 29 sections, 13 equations, 45 figures, 6 tables.

Figures (45)

  • Figure 1: Example of converting Interval Regression into Standard Regression. In approach 1, each interval instance is represented by two endpoints, while in approach 2, it is represented by the midpoint. The goal of Interval Regression is to predict a value falls within the target interval. This example shows that these conversion approaches perform poorly in Interval Regression setting, so they are not recommended.
  • Figure 2: Visualization of the Loss Functions: Error values relative to predictions and targets.
  • Figure 3: Visualization of the AFT XGBoost Loss (negative log-likelihood) for 3 distributions.
  • Figure 4: The mean and standard deviation of the log of test squared hinge errors from simulated datasets. The Linear model performs best when the dataset is linear. In nonlinear datasets, Tree-based models achieve the best performance.
  • Figure 5: The mean and standard deviation of test squared hinge errors for datasets with a high number of features. Tree-based models generally perform well due to their inherent feature selection mechanism. On the other hand, while MLP with ReLU activation function is a more generalized Linear model, it fails to outperform the Linear model. One reason for this is that when a dataset contains a majority of noisy features, MLP cannot effectively reduce their impact on predictions in the same way that a Linear model with L1 regularization can.
  • ...and 40 more figures