Table of Contents
Fetching ...

SoftStep: Learning Sparse Similarity Powers Deep Neighbor-Based Regression

Aviad Susman, Baihan Lin, Mayte Suárez-Fariñas, Joseph T Colonel

TL;DR

The paper addresses the underutilization of neighbor-based regression in deep learning by introducing SoftStep, a differentiable module that learns sparse instance-wise similarities to power nonlinear, neighbor-based regression heads. It provides theoretical links showing that mean squared error on neighbor-based predictions induces structuring constraints on embedding spaces (pairwise and triplet relationships) and demonstrates through extensive experiments that SoftStep-enhanced heads outperform traditional linear predictors across varied architectures and unstructured data domains. By unifying neighbor-based regression with differentiable similarity warping, the work highlights a bridge to sparse attention and representational alignment, offering a plug-in approach that improves expressiveness without sacrificing end-to-end training. The findings suggest SoftStep as a general mechanism for adaptive, sparse similarity in deep networks, with potential applications in attention, metric learning, and representational analysis beyond regression.

Abstract

Neighbor-based methods are a natural alternative to linear prediction for tabular data when relationships between inputs and targets exhibit complexity such as nonlinearity, periodicity, or heteroscedasticity. Yet in deep learning on unstructured data, nonparametric neighbor-based approaches are rarely implemented in lieu of simple linear heads. This is primarily due to the ability of systems equipped with linear regression heads to co-learn internal representations along with the linear head's parameters. To unlock the full potential of neighbor-based methods in neural networks we introduce SoftStep, a parametric module that learns sparse instance-wise similarity measures directly from data. When integrated with existing neighbor-based methods, SoftStep enables regression models that consistently outperform linear heads across diverse architectures, domains, and training scenarios. We focus on regression tasks, where we show theoretically that neighbor-based prediction with a mean squared error objective constitutes a metric learning algorithm that induces well-structured embedding spaces. We then demonstrate analytically and empirically that this representational structure translates into superior performance when combined with the sparse, instance-wise similarity measures introduced by SoftStep. Beyond regression, SoftStep is a general method for learning instance-wise similarity in deep neural networks, with broad applicability to attention mechanisms, metric learning, representational alignment, and related paradigms.

SoftStep: Learning Sparse Similarity Powers Deep Neighbor-Based Regression

TL;DR

The paper addresses the underutilization of neighbor-based regression in deep learning by introducing SoftStep, a differentiable module that learns sparse instance-wise similarities to power nonlinear, neighbor-based regression heads. It provides theoretical links showing that mean squared error on neighbor-based predictions induces structuring constraints on embedding spaces (pairwise and triplet relationships) and demonstrates through extensive experiments that SoftStep-enhanced heads outperform traditional linear predictors across varied architectures and unstructured data domains. By unifying neighbor-based regression with differentiable similarity warping, the work highlights a bridge to sparse attention and representational alignment, offering a plug-in approach that improves expressiveness without sacrificing end-to-end training. The findings suggest SoftStep as a general mechanism for adaptive, sparse similarity in deep networks, with potential applications in attention, metric learning, and representational analysis beyond regression.

Abstract

Neighbor-based methods are a natural alternative to linear prediction for tabular data when relationships between inputs and targets exhibit complexity such as nonlinearity, periodicity, or heteroscedasticity. Yet in deep learning on unstructured data, nonparametric neighbor-based approaches are rarely implemented in lieu of simple linear heads. This is primarily due to the ability of systems equipped with linear regression heads to co-learn internal representations along with the linear head's parameters. To unlock the full potential of neighbor-based methods in neural networks we introduce SoftStep, a parametric module that learns sparse instance-wise similarity measures directly from data. When integrated with existing neighbor-based methods, SoftStep enables regression models that consistently outperform linear heads across diverse architectures, domains, and training scenarios. We focus on regression tasks, where we show theoretically that neighbor-based prediction with a mean squared error objective constitutes a metric learning algorithm that induces well-structured embedding spaces. We then demonstrate analytically and empirically that this representational structure translates into superior performance when combined with the sparse, instance-wise similarity measures introduced by SoftStep. Beyond regression, SoftStep is a general method for learning instance-wise similarity in deep neural networks, with broad applicability to attention mechanisms, metric learning, representational alignment, and related paradigms.

Paper Structure

This paper contains 35 sections, 12 equations, 7 figures, 2 tables, 2 algorithms.

Figures (7)

  • Figure 1: The architecture of the neural network regressors tested in this paper. The left side shows the head-to-head comparison between linear heads and neighbor-based heads evaluated in this work. The right side shows the specifics of neighbor-based regression augmented with SoftStep. Here, $\sim$ refers to a soft ranking operation in the context of differentiable $k$-NN and a row-wise min-max normalization of the similarity matrix in the context of NCA.
  • Figure 2: The SoftStep family of functions learns sparse attention smoothly and differentiably. SoftStep is a family of increasing surjective functions mapping the unit interval to itself. Plotted here is the SoftStep function as the transition parameter $t$ varies from 0 to 1. Parameters $l$ and $u$ define the transition boundaries.
  • Figure 3: NCA and SoftStep enhance regression by modeling nonlinear relationships, allowing learned manifolds to better reflect underlying data distributions. Synthetic datasets were generated by sampling points from the domain $[-1,1]^2$ and assigning continuous labels using various target functions. The color gradients depict the target values and resulting regression surfaces learned by a linear regression head, NCA \ref{['sec: nca']}, and NCA augmented with SoftStep \ref{['sec: softstep']}. The linear head cannot capture the complexity of the non-linear targets. Conversely, NCA, especially when augmented with SoftStep, learns smooth nonlinear regression surfaces that capture the underlying distribution. This capability relaxes constraints on upstream neural networks when learning predictive representations.
  • Figure 4: We plot the total floating point operations (FLOPs) for each method tested as we increase the batch size/number of neighbors ($N$) and embedding dimension ($d$) exponentially. For each method, we attach a linear layer that condenses the data from $4\times d$ to $d$ before regression. We run 10 forwards and backwards passes with NVIDIA Tesla V100 PCIE 16GB GPUs and report the average FLOPs. When we vary $N$, $d=25$ to reflect our experiments. Similarly, when we vary the embedding dimension, $N=32$. We plot $\log_2(\text{FLOP})$ to account for the exponential scale of the x-axes. Since SoftStep parameter generation and similarity warping are each $O(dN)$, our neighbor-based methods have complexity $O(dN^2)$ as with other self-attention-style mechanisms DBLP:conf/nips/VaswaniSPUJGKP17. Our observations are consistent with blondel2020fast.
  • Figure 5: The x-axis ticks are hyperparameter combination triplets of the form (similarity measure, embedding dimension, batch size). The best-performing hyperparameter configurations by mean performance were (RBF, 25, 32).
  • ...and 2 more figures