Table of Contents
Fetching ...

Tensor Network Based Feature Learning Model

Albert Saiapin, Kim Batselier

TL;DR

The paper tackles kernel methods' scalability by introducing a Feature Learning (FL) model that uses a CPD-parametrized tensor-product feature map and learns feature weights alongside model weights via ALS, thereby avoiding cross-validation for hyperparameters. By representing both the weights and the feature map with CPD and employing a rank-P CPD for the feature map, the approach yields linear-in-D storage and scalable training, with quantized Fourier features further reducing memory. Empirical results on small and large-scale datasets show FL trains 3-5x faster than cross-validated CPD kernel machines while maintaining comparable or better prediction accuracy, demonstrating practical impact for large-scale tensorized kernel learning. The work contributes a hyperparameter-aware, tensor-network framework for scalable, kernel-like learning with avenues for parallelization and probabilistic extensions.

Abstract

Many approximations were suggested to circumvent the cubic complexity of kernel-based algorithms, allowing their application to large-scale datasets. One strategy is to consider the primal formulation of the learning problem by mapping the data to a higher-dimensional space using tensor-product structured polynomial and Fourier features. The curse of dimensionality due to these tensor-product features was effectively solved by a tensor network reparameterization of the model parameters. However, another important aspect of model training - identifying optimal feature hyperparameters - has not been addressed and is typically handled using the standard cross-validation approach. In this paper, we introduce the Feature Learning (FL) model, which addresses this issue by representing tensor-product features as a learnable Canonical Polyadic Decomposition (CPD). By leveraging this CPD structure, we efficiently learn the hyperparameters associated with different features alongside the model parameters using an Alternating Least Squares (ALS) optimization method. We prove the effectiveness of the FL model through experiments on real data of various dimensionality and scale. The results show that the FL model can be consistently trained 3-5 times faster than and have the prediction quality on par with a standard cross-validated model.

Tensor Network Based Feature Learning Model

TL;DR

The paper tackles kernel methods' scalability by introducing a Feature Learning (FL) model that uses a CPD-parametrized tensor-product feature map and learns feature weights alongside model weights via ALS, thereby avoiding cross-validation for hyperparameters. By representing both the weights and the feature map with CPD and employing a rank-P CPD for the feature map, the approach yields linear-in-D storage and scalable training, with quantized Fourier features further reducing memory. Empirical results on small and large-scale datasets show FL trains 3-5x faster than cross-validated CPD kernel machines while maintaining comparable or better prediction accuracy, demonstrating practical impact for large-scale tensorized kernel learning. The work contributes a hyperparameter-aware, tensor-network framework for scalable, kernel-like learning with avenues for parallelization and probabilistic extensions.

Abstract

Many approximations were suggested to circumvent the cubic complexity of kernel-based algorithms, allowing their application to large-scale datasets. One strategy is to consider the primal formulation of the learning problem by mapping the data to a higher-dimensional space using tensor-product structured polynomial and Fourier features. The curse of dimensionality due to these tensor-product features was effectively solved by a tensor network reparameterization of the model parameters. However, another important aspect of model training - identifying optimal feature hyperparameters - has not been addressed and is typically handled using the standard cross-validation approach. In this paper, we introduce the Feature Learning (FL) model, which addresses this issue by representing tensor-product features as a learnable Canonical Polyadic Decomposition (CPD). By leveraging this CPD structure, we efficiently learn the hyperparameters associated with different features alongside the model parameters using an Alternating Least Squares (ALS) optimization method. We prove the effectiveness of the FL model through experiments on real data of various dimensionality and scale. The results show that the FL model can be consistently trained 3-5 times faster than and have the prediction quality on par with a standard cross-validated model.

Paper Structure

This paper contains 17 sections, 1 theorem, 21 equations, 2 figures, 2 tables, 1 algorithm.

Key Result

Theorem 2.1

Suppose $\operatorname{ten}\left(\boldsymbol{w}, I_1, I_2, \dots, I_D\right)$ is represented by a CPD. The model responses and gradients of can be computed in $\mathcal{O}(DIR)$ instead of $\mathcal{O}(\prod_{d=1}^D I_d)$, where $I = \max(I_1, \dots, I_D)$.

Figures (2)

  • Figure 1: Tensorized CPD Kernel Machine (Figure \ref{['fig:basic_model']}) and FL Model (Figure \ref{['fig:FL_model']}). In the diagrams, each circle without a cross represents a vector or a matrix (defined by the number of outgoing solid lines, one or two respectively); a crossed circle depicts a three-dimensional tensor containing a particular matrix in the diagonal slice; blue color represents parameters related to non-linear features $\boldsymbol{\psi}_{\theta}^{(d)}(x_d)$, $d = \overline{1, D}$; green color represents model parameters $\boldsymbol{w}$ in a CPD format; a solid line denotes a summation along the corresponding index, while a dashed line denotes a Kronecker product TN1_Cichocki_2016; $\boldsymbol{\Psi}^{(d)} = [\boldsymbol{\psi}_{\theta_1}^{(d)}(x_d), \dots, \boldsymbol{\psi}_{\theta_P}^{(d)}(x_d)] \in\mathbb{C}^{I_d \times P}$. Figure \ref{['fig:basic_model']} depicts model \ref{['intro_simple_model']} with a rank-1 CPD feature map, while Figure \ref{['fig:FL_model']} represents our FL model \ref{['fl_model']} with a rank-$P$ feature map.
  • Figure 2: Plots of the training time (first row) and test MSE (second row) of FL and CV models (orange and blue curves respectively) as a function of the number of features $P$ for different real-life datasets (column-wise). Solid lines represent mean metric calculations and shaded regions depict $\pm 1$ standard deviation around the mean across 10 restarts. The proposed FL model requires consistently less time to train compared to the conventional cross-validation. Likewise the prediction error of the FL model is either similar to CV (shaded regions intersect) or significantly lower (Yacht data) that demonstrates the superiority of the FL model.

Theorems & Definitions (5)

  • Definition 2.1: Vectorization
  • Definition 2.2: Tensorization
  • Definition 2.3: Canonical Polyadic Decomposition TD_Kolda_2009
  • Theorem 2.1: CPD Kernel Machine QTNM_Wesel_2024
  • Definition 2.4: Fourier Features QTNM_Wesel_2024