Table of Contents
Fetching ...

LIT-LVM: Structured Regularization for Interaction Terms in Linear Predictors using Latent Variable Models

Mohammadreza Nemati, Zhipeng Huang, Kevin S. Xu

TL;DR

LIT-LVM addresses the challenge of estimating interaction-term coefficients in linear predictors when the interaction matrix $\mathbf{\Theta}$ is large and potentially noisy. By imposing an approximate low-dimensional structure through latent representations $\mathbf{Z}$ and flexible models (low rank or latent distance), it jointly estimates model parameters and latent positions via a total loss $\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{pred}} + \lambda_r \mathcal{L}_{\text{reg}} + \lambda_l \mathcal{L}_{\text{lvm}}$, with $\mathcal{L}_{\text{lvm}} = \|\boldsymbol{\epsilon}\|_F^2$ and updates performed by Adam plus proximal steps. The approach substantially improves predictive accuracy over elastic net with interactions, hierarchical lasso, and factorization machines across simulations, OpenML benchmarks, and a kidney transplant survival analysis, while yielding interpretable latent representations of features that capture donor-recipient compatibility. The method balances flexibility with structure, allowing exact low-rank methods to be approached as $\lambda_l\rightarrow\infty$ and providing robustness when the low-rank assumption is only approximate. These results suggest significant practical impact for high-$p$ problems where interaction terms are informative but difficult to estimate reliably, particularly in biomedical and recommender-type settings where latent structure among features can be exploited for both accuracy and interpretability.

Abstract

Some of the simplest, yet most frequently used predictors in statistics and machine learning use weighted linear combinations of features. Such linear predictors can model non-linear relationships between features by adding interaction terms corresponding to the products of all pairs of features. We consider the problem of accurately estimating coefficients for interaction terms in linear predictors. We hypothesize that the coefficients for different interaction terms have an approximate low-dimensional structure and represent each feature by a latent vector in a low-dimensional space. This low-dimensional representation can be viewed as a structured regularization approach that further mitigates overfitting in high-dimensional settings beyond standard regularizers such as the lasso and elastic net. We demonstrate that our approach, called LIT-LVM, achieves superior prediction accuracy compared to the elastic net, hierarchical lasso, and factorization machines on a wide variety of simulated and real data, particularly when the number of interaction terms is high compared to the number of samples. LIT-LVM also provides low-dimensional latent representations for features that are useful for visualizing and analyzing their relationships.

LIT-LVM: Structured Regularization for Interaction Terms in Linear Predictors using Latent Variable Models

TL;DR

LIT-LVM addresses the challenge of estimating interaction-term coefficients in linear predictors when the interaction matrix is large and potentially noisy. By imposing an approximate low-dimensional structure through latent representations and flexible models (low rank or latent distance), it jointly estimates model parameters and latent positions via a total loss , with and updates performed by Adam plus proximal steps. The approach substantially improves predictive accuracy over elastic net with interactions, hierarchical lasso, and factorization machines across simulations, OpenML benchmarks, and a kidney transplant survival analysis, while yielding interpretable latent representations of features that capture donor-recipient compatibility. The method balances flexibility with structure, allowing exact low-rank methods to be approached as and providing robustness when the low-rank assumption is only approximate. These results suggest significant practical impact for high- problems where interaction terms are informative but difficult to estimate reliably, particularly in biomedical and recommender-type settings where latent structure among features can be exploited for both accuracy and interpretability.

Abstract

Some of the simplest, yet most frequently used predictors in statistics and machine learning use weighted linear combinations of features. Such linear predictors can model non-linear relationships between features by adding interaction terms corresponding to the products of all pairs of features. We consider the problem of accurately estimating coefficients for interaction terms in linear predictors. We hypothesize that the coefficients for different interaction terms have an approximate low-dimensional structure and represent each feature by a latent vector in a low-dimensional space. This low-dimensional representation can be viewed as a structured regularization approach that further mitigates overfitting in high-dimensional settings beyond standard regularizers such as the lasso and elastic net. We demonstrate that our approach, called LIT-LVM, achieves superior prediction accuracy compared to the elastic net, hierarchical lasso, and factorization machines on a wide variety of simulated and real data, particularly when the number of interaction terms is high compared to the number of samples. LIT-LVM also provides low-dimensional latent representations for features that are useful for visualizing and analyzing their relationships.

Paper Structure

This paper contains 54 sections, 8 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: Our proposed framework for linear predictors with interaction terms. We arrange the coefficient vector for the interaction terms $\mathbf{\boldsymbol{\theta}}_{\text{flat}}$ into an upper triangular matrix $\mathbf{\boldsymbol{\Theta}}$. Our main contribution is to impose an approximate low-dimensional structure on $\mathbf{\boldsymbol{\Theta}}$ using a structured regularization. We represent each of the $p$ features by a latent vector $\mathbf{\boldsymbol{z}}_j$ in $d$ dimensions, with $d < p$, and minimize the deviation between $\theta_{jk}$ and a function of $\mathbf{\boldsymbol{z}}_j$ and $\mathbf{\boldsymbol{z}}_k$ such as the dot product $\mathbf{\boldsymbol{z}}_j^T \mathbf{\boldsymbol{z}}_k$. In this toy example, $p=4$ and $d=2$.
  • Figure 2: Results of simulation experiments on linear and logistic regression where the true latent dimension for the interaction weights $\mathbf{\boldsymbol{\Theta}}$ is $d_{\text{true}}=2$. (\ref{['fig:sim_linreg']}) Lower RMSE and (\ref{['fig:sim_logreg']}) higher AUC denote higher prediction accuracy. Error bars denote one standard error. Our proposed LIT-LVM approach, which uses structured regularization, significantly outperforms the elastic net with interaction terms, a traditional regularization approach, as $p$ increases for fixed $n$.
  • Figure 3: Mean AUC for different values of the LVM regularization strength $\lambda_l$ on simulated data experiments using logistic regression. (\ref{['fig:hp_analysis_sim_0.1']}) The simulation with low noise ($\sigma^{2}_{\theta}=0.1$) peaks in AUC at an intermediate $\lambda_l$, confirming the benefit of the LVM regularization term. (\ref{['fig:hp_analysis_sim_4']}) The simulation with high noise ($\sigma^{2}_{\theta}=4$) has destroyed the latent structure, and thus shows minimal benefit from the LVM regularization.
  • Figure 4: Mean AUC for different values of the LVM regularization strength $\lambda_l$ on real OpenML classification datasets. (\ref{['fig:hp_analysis_clean1']}) The clean1 dataset peaks in AUC at a moderate $\lambda_l$ between $0.01$ and $0.1$, while the more difficult fri_c4_500_100 dataset benefits from stronger LVM regularization with $\lambda_l$ in the range of $1$ to $100$.
  • Figure 5: Latent space representation of HLA-DR pairs. Green and magenta dashed lines denote HLA pairs associated with the best (most compatible) and worst (least compatible) outcomes, respectively. In the LIT-LVM latent space, compatible HLA pairs are positioned closely and incompatible pairs are widely separated, whereas PCA fails to preserve this property.
  • ...and 13 more figures