Optimal Algorithms in Linear Regression under Covariate Shift: On the Importance of Precondition
Yuanshi Liu, Haihan Zhang, Qian Chen, Cong Fang
TL;DR
This work establishes an information-theoretic minimax optimality result for high-dimensional linear regression under covariate shift, showing that the minimax estimator is a linear preconditioned transform of the source-domain estimator with an efficiently computable preconditioner. It introduces a convex program to select the optimal preconditioner $\boldsymbol{A}$ and proves a matching upper bound, thereby characterizing the minimax rate. Beyond the lower bound, the authors analyze ASGD and SGD with momentum, providing instance-based conditions under which ASGD attains optimality in a broad class of target distributions via a preconditioning interpretation. The study also uncovers emergent phenomena in learning curves under covariate shift and discusses practical implications for designing preconditioned optimization methods in covariate-shift settings.
Abstract
A common pursuit in modern statistical learning is to attain satisfactory generalization out of the source data distribution (OOD). In theory, the challenge remains unsolved even under the canonical setting of covariate shift for the linear model. This paper studies the foundational (high-dimensional) linear regression where the ground truth variables are confined to an ellipse-shape constraint and addresses two fundamental questions in this regime: (i) given the target covariate matrix, what is the min-max \emph{optimal} algorithm under covariate shift? (ii) for what kinds of target classes, the commonly-used SGD-type algorithms achieve optimality? Our analysis starts with establishing a tight lower generalization bound via a Bayesian Cramer-Rao inequality. For (i), we prove that the optimal estimator can be simply a certain linear transformation of the best estimator for the source distribution. Given the source and target matrices, we show that the transformation can be efficiently computed via a convex program. The min-max optimal analysis for SGD leverages the idea that we recognize both the accumulated updates of the applied algorithms and the ideal transformation as preconditions on the learning variables. We provide sufficient conditions when SGD with its acceleration variants attain optimality.
