Derivative-based regularization for regression

Enrico Lopedoto; Maksim Shekhunov; Vitaly Aksenov; Kizito Salako; Tillman Weyde

Derivative-based regularization for regression

Enrico Lopedoto, Maksim Shekhunov, Vitaly Aksenov, Kizito Salako, Tillman Weyde

TL;DR

This paper introduces DLoss, a derivative-based regularizer for regression that aligns a model's directional derivatives with data-derived derivatives estimated from training tuples. By combining a derivative-matching term with the standard mean squared error, and using either nearest-neighbour or random tuple selection to compute data derivatives, the method aims to capture the target function's differential structure. Empirical results on real and synthetic datasets show that DLoss, especially with nearest-neighbour tuples, improves validation MSE on average and often ranks first among considered regularizers, albeit with higher computational cost. The approach provides a data-driven regularization mechanism that can enhance generalization without altering model architecture, and it opens avenues for extension to more models and to classification tasks.

Abstract

In this work, we introduce a novel approach to regularization in multivariable regression problems. Our regularizer, called DLoss, penalises differences between the model's derivatives and derivatives of the data generating function as estimated from the training data. We call these estimated derivatives data derivatives. The goal of our method is to align the model to the data, not only in terms of target values but also in terms of the derivatives involved. To estimate data derivatives, we select (from the training data) 2-tuples of input-value pairs, using either nearest neighbour or random, selection. On synthetic and real datasets, we evaluate the effectiveness of adding DLoss, with different weights, to the standard mean squared error loss. The experimental results show that with DLoss (using nearest neighbour selection) we obtain, on average, the best rank with respect to MSE on validation data sets, compared to no regularization, L2 regularization, and Dropout.

Derivative-based regularization for regression

TL;DR

Abstract

Paper Structure (20 sections, 10 equations, 5 figures, 5 tables)

This paper contains 20 sections, 10 equations, 5 figures, 5 tables.

Introduction
Related Work
Regularization
Neural Networks, Derivatives and Differential Equations
Derivative-based Regularization Method
Intuition
Regression
Regularization
Derivative Error
Data Derivative
Model Derivative
Tuple Selection
Optimization
Experiments
Model parameters
...and 5 more sections

Figures (5)

Figure 1: DLoss approach in the one-dimensional case for a tuple of pairs $((\mathbf{x}_i, y_i),(\mathbf{x}_j, y_j))$. On the horizontal axis we show the inputs $x$, which are scalars in this example. On the vertical axis, we show the scalar regression targets. In general, for a pair of points $\mathbf{x}_i$ and $\mathbf{x}_j$, we calculate the midpoint$\mathbf{x}_m = (\mathbf{x}_i+\mathbf{x}_j)/2$ and the difference vector $\mathbf{v} = \mathbf{x}_j - \mathbf{x}_i$. The red line shows the model derivative $\nabla_{\mathbf{v}}f(\mathbf{x}_m)$ calculated at $\mathbf{x}_m$. The blue line shows the estimated derivative $\nabla^*_{\mathbf{v}}g(\mathbf{x}_m)$ calculated at $\mathbf{x}_m$. With DLoss, we aim to make the blue and the red lines parallel.
Figure 2: Illustration of data derivatives calculated from two tuples of pairs. For each tuple $((\mathbf{x}^s_i, y^s_i),(\mathbf{x}^s_j, y^s_j))$, where $s\in\{1,2\}$, we calculate the midpoint $\mathbf{x}^s_m$, difference vector $\mathbf{v}^s$, and the corresponding data derivative $\nabla^{*}_{\mathbf{v}^s}g(\mathbf{x}^s_m)$ over a 2-dimensional feature space ($\mathbf{x} \in {\mathbb R}^2$).
Figure 3: Learning curves of the 5-folds cross validation average fro Real Dataset group. Training set - left - and Validation set - right. Epochs on the x-axis and MSE on the y-axes. The curves show: $STD$, $STD+L_2$, $STD+DO$, $DL_{RND}$ and $DL_{NN}$, each with the best parameters. Selection criterion is the parameter combination leading to lowest $MSE_{val}$. Real datasets: ANES96, CANCER, DIABETES, MODECHOICE and WINE.
Figure 4: Learning curves of the 5-folds cross validation average fro Synthetic Dataset group. Training set - left - and Validation set - right. Epochs on the x-axis and MSE on the y-axes. The curves show: $STD$, $STD+L_2$, $STD+DO$, $DL_{RND}$ and $DL_{NN}$, each with the best parameters. Selection criterion is the parameter combination leading to lowest $MSE_{val}$. Synthetic datasets: F1, REGRESSION1 and 10, SPARSE UNCORR and SWISS ROLL.
Figure 5: The histograms of the pairwise difference of $MSE_{val}$ between the $STD$ groups ($STD$, $STD+L_2$, $STD+DO$) and the $DL$ groups ($DL_{RND}$,$DL_{NN}$) per each dataset and cross validation fold. The frequency reported is for each of the 5-folds belonging to the 5 real dataset (top 2 rows) and 5 synthetic dataset (bottom 2 row), for a total of 25 samples for each pairwise comparison. $Median_{\Delta}$ is the median of the differences in each comparison. We plot the $Median_{\Delta}$ as a dotted black line and the 0 line in red as a reference point. For the synthetic data, the $Median_{\Delta}$ is so close to 0 that the two lines are not visually distinguishable.

Derivative-based regularization for regression

TL;DR

Abstract

Derivative-based regularization for regression

Authors

TL;DR

Abstract

Table of Contents

Figures (5)