Invariant Subspace Decomposition

Margherita Lazzaretto; Jonas Peters; Niklas Pfister

Invariant Subspace Decomposition

Margherita Lazzaretto, Jonas Peters, Niklas Pfister

TL;DR

Invariant Subspace Decomposition (ISD) addresses non-stationary regression where $Y_t$ given $X_t$ evolves over time. It splits the parameter $\gamma_{0,t}$ into a time-invariant component $\beta^{\text{inv}}$ in an invariant subspace $\mathcal{S}^{\text{inv}}$ and a residual time-varying component $\delta^{\text{res}}_t$ in the complementary subspace $\mathcal{S}^{\text{res}}$, enabling zero-shot and time-adaptation prediction. The invariant part is learned from historical data via joint block diagonalization to identify $\mathcal{S}^{\text{inv}}$, while the residual part is estimated using adaptation data, achieving a finite-sample error bound that scales with $\dim(\mathcal{S}^{\text{inv}})/n + \dim(\mathcal{S}^{\text{res}})/m$. Theoretical results show ISD can outperform naive OLS and maximin approaches in non-stationary settings, with empirical validation on synthetic and real data demonstrating improved predictive accuracy in both zero-shot and time-adaptation tasks. The work lays groundwork for extending invariant-based time adaptation to nonlinear models and domain-specific applications.

Abstract

We consider the task of predicting a response Y from a set of covariates X in settings where the conditional distribution of Y given X changes over time. For this to be feasible, assumptions on how the conditional distribution changes over time are required. Existing approaches assume, for example, that changes occur smoothly over time so that short-term prediction using only the recent past becomes feasible. To additionally exploit observations further in the past, we propose a novel invariance-based framework for linear conditionals, called Invariant Subspace Decomposition (ISD), that splits the conditional distribution into a time-invariant and a residual time-dependent component. As we show, this decomposition can be utilized both for zero-shot and time-adaptation prediction tasks, that is, settings where either no or a small amount of training data is available at the time points we want to predict Y at, respectively. We propose a practical estimation procedure, which automatically infers the decomposition using tools from approximate joint matrix diagonalization. Furthermore, we provide finite sample guarantees for the proposed estimator and demonstrate empirically that it indeed improves on approaches that do not use the additional invariant structure.

Invariant Subspace Decomposition

TL;DR

Invariant Subspace Decomposition (ISD) addresses non-stationary regression where

given

evolves over time. It splits the parameter

into a time-invariant component

in an invariant subspace

and a residual time-varying component

in the complementary subspace

, enabling zero-shot and time-adaptation prediction. The invariant part is learned from historical data via joint block diagonalization to identify

, while the residual part is estimated using adaptation data, achieving a finite-sample error bound that scales with

. Theoretical results show ISD can outperform naive OLS and maximin approaches in non-stationary settings, with empirical validation on synthetic and real data demonstrating improved predictive accuracy in both zero-shot and time-adaptation tasks. The work lays groundwork for extending invariant-based time adaptation to nonlinear models and domain-specific applications.

Abstract

Paper Structure (39 sections, 16 theorems, 131 equations, 14 figures, 1 table, 2 algorithms)

This paper contains 39 sections, 16 theorems, 131 equations, 14 figures, 1 table, 2 algorithms.

Introduction
Invariant subspace decomposition
Invariant and residual subspaces
Identifying invariant and residual subspaces using joint block diagonalization
Invariant component
Residual component and time adaptation
Population ISD algorithm
Analysis of the two ISD tasks: zero-shot generalization and time adaptation
Zero-shot task
Adaptation task
ISD estimator and its finite sample generalization guarantee
Estimating the subspace decomposition
Approximate joint block diagonalization
Estimating the invariant and residual subspaces
Estimating the invariant and residual components
...and 24 more sections

Key Result

Lemma 1

Let $\{\mathcal{S}_j\}_{j=1}^q$ be an orthogonal and $(X_t)_{t\in[n]}$-decorrelating partition. Then it holds for all $t\in[n]$ that $\gamma_{0,t}=\sum_{j=1}^q \Pi_{\mathcal{S}_j}\gamma_{0,t}$ and for all $j\in\{1,\dots, q\}$ that where $(\cdot)^\dagger$ denotes the Moore-Penrose pseudoinverse.

Figures (14)

Figure 1: Example of two-dimensional true parameter $\gamma_{0,t}$ varying on a one-dimensional subspace of $\mathbb{R}^2$, and its estimates using ISD (right) compared to rolling window OLS (left). Time is visually encoded using a color map. (Left) True parameter $\gamma_{0,t}$ (hexagons) on $350$ test points and OLS estimates $\hat{\gamma}^{\operatorname{OLS}}_t$ based on rolling windows of size $16$. (Right) Same test data and true parameters $\gamma_{0,t}$, but now we additionally use $1000$ prior time-points as historical data (not shown) to estimate the decomposition of $\mathbb{R}^2$ into the orthogonal subspaces $\mathcal{S}^{\operatorname{inv}}$ and $\mathcal{S}^{\operatorname{res}}$ (dashed lines). Next, we estimate $\beta^{\operatorname{inv}}$ using the historical data and $\hat{\mathcal{S}}^{\operatorname{inv}}$. Then, using the same rolling windows as in the left plot as adaption data, we estimate $\delta^{\operatorname{res}}_t$ using $\hat{\mathcal{S}}^{\operatorname{res}}$. The ISD estimates are then given by $\hat{\gamma}_t^{ISD}=\hat{\beta}^{\operatorname{inv}}+\hat{\delta}^{\operatorname{res}}_t$. All details on the generative model are provided in Example \ref{['ex:ex:running_ex_2d']}. The subspaces $\mathcal{S}^{\operatorname{inv}}$ and $\mathcal{S}^{\operatorname{res}}$ do not need to be axis aligned, so ISD is applicable even in cases where the conditional of $Y_t$ given $X_t$ and all conditionals of $Y_t$ given subsets of $X_t$ vary over time.
Figure 2: For the data-generating model in Section \ref{['sec:simulations_time_adaptation']}, we plot (left) the average explained variance (distribution over 20 runs) obtained at training time (historical data) and (right) the cumulative explained variance obtained testing time (adaptation data) (in one of the 20 runs); in this example, the time-varying components in the historical and adaptation data have disjoint support. The example considers $p=10$-dimensional predictors and an invariant component of dimension $7$. As baselines, we use (i) the true time-varying parameter $\gamma_{0,t}$, which maximizes the explained variance at all observed time points $t$, and (ii) the oracle invariant component $\beta^{\operatorname{inv}}$. $6000$ historical observations are used to estimate: (iii) the invariant component $\hat{\beta}^{\operatorname{inv}}$ of the ISD framework, (iv) the OLS solution $\hat{\beta}^{\operatorname{OLS}}$, (v) the maximin effect $\hat{\beta}^{\operatorname{mm}}$. Starting from $t=0$ after the observed history, windows of length $3p$ are used to estimate: (vi) the adaptation parameter $\hat{\delta}^{\operatorname{res}}_t$ for $\hat{\beta}^{\operatorname{inv}}$ to obtain the ISD estimate $\hat{\gamma}^{ISD}$ and (vii) the rolling window OLS solution $\hat{\gamma}^{\operatorname{OLS}}_t$. While at training time on historical data the ISD invariant component $\hat{\beta}^{\operatorname{inv}}$ is the most conservative, with the lowest average explained variance, after a distribution shift (adaptation data) the same component can explain higher variance than other methods based on historical data only ($\hat{\beta}^{\operatorname{OLS}}, \hat{\beta}^{\operatorname{mm}}$), and can be tuned to new time points to improve on estimators based on adaptation data only ($\hat{\gamma}^{\operatorname{OLS}}_t$).
Figure 3: Illustration of the historical and adaptation data and the zero-shot and adaptation tasks.
Figure 4: MSE of $\hat{\beta}^{\operatorname{inv}}$ for increasing size of the historical data $n$ (see Section \ref{['sec:invariant_exp']}). For larger values of $n$, the estimation of the invariant subspace decomposition becomes more precise and leads to smaller errors in the estimated invariant component $\hat{\beta}^{\operatorname{inv}}$.
Figure 5: Normalized explained variance ($R^2$) by $\hat{\beta}^{\operatorname{inv}}$ and comparison with $\beta^{\operatorname{inv}}$, $\hat{\beta}^{\operatorname{mm}}$ and $\hat{\beta}^{\operatorname{OLS}}$: training (historical data, left) and zero-shot generalization (test data, right), for different sizes $n$ of the historical data (see Section \ref{['sec:invariant_exp']}). The dashed line indicates the population value of the (normalized) explained variance by $\beta^{\operatorname{inv}}$.
...and 9 more figures

Theorems & Definitions (44)

Definition 1: time-invariance
Remark 1: Exchanging explained variance with MSPE
Lemma 1
Lemma 2
Example 1: label=ex:running_ex_2d
Theorem 1
Proposition 1
Example 2: continues=ex:running_ex_2d
Definition 2: Invariant component
Proposition 2: Properties of $\beta^{\operatorname{inv}}$
...and 34 more

Invariant Subspace Decomposition

TL;DR

Abstract

Invariant Subspace Decomposition

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (44)