Anchor regression: heterogeneous data meets causality
Dominik Rothenhäusler, Nicolai Meinshausen, Peter Bühlmann, Jonas Peters
TL;DR
Anchor regression addresses predictive generalization under distributional shifts by leveraging exogenous anchors to regularize the least-squares loss. It forms a continuous path between partialling out, ordinary least squares, and two-stage least squares, with a rigorous minimax interpretation: the penalized criterion equals the worst-case risk over a class of shift interventions. The approach yields distributional robustness and improved replicability of variable selection, even when instrumental-variable assumptions fail, and it provides finite-sample bounds in high dimensions. Empirical results on GTEx and bike-sharing data illustrate enhanced stability and predictive reliability across heterogeneous domains, supporting the method's practical utility for robust inference under structured perturbations. The work also outlines practical guidance for anchor choice and parameter tuning, and discusses extensions to nonlinear models and other perturbation types.
Abstract
We consider the problem of predicting a response variable from a set of covariates on a data set that differs in distribution from the training data. Causal parameters are optimal in terms of predictive accuracy if in the new distribution either many variables are affected by interventions or only some variables are affected, but the perturbations are strong. If the training and test distributions differ by a shift, causal parameters might be too conservative to perform well on the above task. This motivates anchor regression, a method that makes use of exogeneous variables to solve a relaxation of the causal minimax problem by considering a modification of the least-squares loss. The procedure naturally provides an interpolation between the solutions of ordinary least squares and two-stage least squares. We prove that the estimator satisfies predictive guarantees in terms of distributional robustness against shifts in a linear class; these guarantees are valid even if the instrumental variables assumptions are violated. If anchor regression and least squares provide the same answer (anchor stability), we establish that OLS parameters are invariant under certain distributional changes. Anchor regression is shown empirically to improve replicability and protect against distributional shifts.
