Table of Contents
Fetching ...

Meta-Learning with Generalized Ridge Regression: High-dimensional Asymptotics, Optimality and Hyper-covariance Estimation

Yanhao Jin, Krishnakumar Balasubramanian, Debashis Paul

TL;DR

This work analyzes meta-learning under a high-dimensional multivariate random-effects model, showing how generalized ridge regression with a weight tied to the hyper-covariance $\Omega$ can improve generalization to unseen tasks. It establishes precise high-dimensional limits for predictive risk, proves optimality of using $\Omega^{-1}$ as the ridge weight, and develops a scalable geodesically convex method-of-moments estimator for $\Omega$ (with extensions to sparse settings). The proposed framework leverages random matrix theory to characterize the limiting risk and uses Riemannian optimization to efficiently estimate $\Omega$ without relying on non-convex MLE. Numerical experiments confirm the theoretical gains, demonstrating improved predictive performance on new tasks, particularly when hyper-covariance structure is accurately estimated or suitably regularized. The results provide a principled approach to meta-learning that exploits task similarities while remaining computationally feasible in high dimensions.

Abstract

Meta-learning involves training models on a variety of training tasks in a way that enables them to generalize well on new, unseen test tasks. In this work, we consider meta-learning within the framework of high-dimensional multivariate random-effects linear models and study generalized ridge-regression based predictions. The statistical intuition of using generalized ridge regression in this setting is that the covariance structure of the random regression coefficients could be leveraged to make better predictions on new tasks. Accordingly, we first characterize the precise asymptotic behavior of the predictive risk for a new test task when the data dimension grows proportionally to the number of samples per task. We next show that this predictive risk is optimal when the weight matrix in generalized ridge regression is chosen to be the inverse of the covariance matrix of random coefficients. Finally, we propose and analyze an estimator of the inverse covariance matrix of random regression coefficients based on data from the training tasks. As opposed to intractable MLE-type estimators, the proposed estimators could be computed efficiently as they could be obtained by solving (global) geodesically-convex optimization problems. Our analysis and methodology use tools from random matrix theory and Riemannian optimization. Simulation results demonstrate the improved generalization performance of the proposed method on new unseen test tasks within the considered framework.

Meta-Learning with Generalized Ridge Regression: High-dimensional Asymptotics, Optimality and Hyper-covariance Estimation

TL;DR

This work analyzes meta-learning under a high-dimensional multivariate random-effects model, showing how generalized ridge regression with a weight tied to the hyper-covariance can improve generalization to unseen tasks. It establishes precise high-dimensional limits for predictive risk, proves optimality of using as the ridge weight, and develops a scalable geodesically convex method-of-moments estimator for (with extensions to sparse settings). The proposed framework leverages random matrix theory to characterize the limiting risk and uses Riemannian optimization to efficiently estimate without relying on non-convex MLE. Numerical experiments confirm the theoretical gains, demonstrating improved predictive performance on new tasks, particularly when hyper-covariance structure is accurately estimated or suitably regularized. The results provide a principled approach to meta-learning that exploits task similarities while remaining computationally feasible in high dimensions.

Abstract

Meta-learning involves training models on a variety of training tasks in a way that enables them to generalize well on new, unseen test tasks. In this work, we consider meta-learning within the framework of high-dimensional multivariate random-effects linear models and study generalized ridge-regression based predictions. The statistical intuition of using generalized ridge regression in this setting is that the covariance structure of the random regression coefficients could be leveraged to make better predictions on new tasks. Accordingly, we first characterize the precise asymptotic behavior of the predictive risk for a new test task when the data dimension grows proportionally to the number of samples per task. We next show that this predictive risk is optimal when the weight matrix in generalized ridge regression is chosen to be the inverse of the covariance matrix of random coefficients. Finally, we propose and analyze an estimator of the inverse covariance matrix of random regression coefficients based on data from the training tasks. As opposed to intractable MLE-type estimators, the proposed estimators could be computed efficiently as they could be obtained by solving (global) geodesically-convex optimization problems. Our analysis and methodology use tools from random matrix theory and Riemannian optimization. Simulation results demonstrate the improved generalization performance of the proposed method on new unseen test tasks within the considered framework.
Paper Structure (38 sections, 24 theorems, 387 equations, 3 figures, 12 tables, 2 algorithms)

This paper contains 38 sections, 24 theorems, 387 equations, 3 figures, 12 tables, 2 algorithms.

Key Result

Theorem 2.1

The predictive risk of generalized ridge regression on the new task indexed by ${L+1}$, using oracle estimator $\tilde{\beta}_{\lambda}^{(L+1)}$ and using estimator $\hat{\beta}_{\lambda}^{(L+1)}$ from estimator_generalized_ridge, are given by and respectively.

Figures (3)

  • Figure 1: Plot of limiting risk in \ref{['expression_risk_MPLaw']}. Here, $\Sigma^{(L+1)}=\varrho\Omega^{-1}$ (for various choices of $a$), $\sigma^2=1.5$ and $\gamma_{L+1}$ takes values in the set $\{1.5,2,3,5,10\}$.
  • Figure 2: Plot of limiting risk when $\Sigma^{(L+1)}=\Omega^{-\kappa}$.
  • Figure 3: Risk $R_{c\lambda^{*}}(\hat{\Omega})$ with different $c$

Theorems & Definitions (57)

  • Definition 2.1: Empirical and limiting spectral distribution (ESD)
  • Example 2.1
  • Theorem 2.1
  • Remark 2.1
  • Definition 2.2: Stieltjes transform
  • Theorem 2.2: marvcenko1967distribution
  • Theorem 2.3
  • Remark 2.2
  • Remark 2.3
  • Lemma 2.1
  • ...and 47 more