Stochastic Optimization Schemes for Performative Prediction with Nonconvex Loss

Qiang Li; Hoi-To Wai

Stochastic Optimization Schemes for Performative Prediction with Nonconvex Loss

Qiang Li, Hoi-To Wai

TL;DR

This work addresses risk minimization in performative prediction where the data distribution depends on the deployed model and the loss is smooth but potentially nonconvex. It analyzes the greedy SGD-GD scheme using a time-varying Lyapunov function to handle non-gradient dynamics and defines stationary performative stability (SPS) as a relaxed notion of stability for nonconvex losses. Under two distribution-shift models—Wasserstein-1 sensitivity with Lipschitz losses and TV sensitivity with bounded losses—the paper proves that SGD-GD converges to a biased SPS solution at rate $\mathcal{O}(1/\sqrt{T})$, with bias scaling as $\mathcal{O}(\varepsilon)$ or $\mathcal{O}(\varepsilon^2)$ depending on gradient variance. An extension to lazy deployment with epoch length $K$ shows reduced bias on the order of $\mathcal{O}((\tilde{L} \varepsilon)^2)$ as $K$ and $T$ grow. Numerical experiments on synthetic and real data corroborate the theory, providing insight into stability and efficiency of stochastic methods in nonconvex performative settings.

Abstract

This paper studies a risk minimization problem with decision dependent data distribution. The problem pertains to the performative prediction setting in which a trained model can affect the outcome estimated by the model. Such dependency creates a feedback loop that influences the stability of optimization algorithms such as stochastic gradient descent (SGD). We present the first study on performative prediction with smooth but possibly non-convex loss. We analyze a greedy deployment scheme with SGD (SGD-GD). Note that in the literature, SGD-GD is often studied with strongly convex loss. We first propose the definition of stationary performative stable (SPS) solutions through relaxing the popular performative stable condition. We then prove that SGD-GD converges to a biased SPS solution in expectation. We consider two conditions of sensitivity on the distribution shifts: (i) the sensitivity is characterized by Wasserstein-1 distance and the loss is Lipschitz w.r.t.~data samples, or (ii) the sensitivity is characterized by total variation (TV) divergence and the loss is bounded. In both conditions, the bias levels are proportional to the stochastic gradient's variance and sensitivity level. Our analysis is extended to a lazy deployment scheme where models are deployed once per several SGD updates, and we show that it converges to an SPS solution with reduced bias. Numerical experiments corroborate our theories.

Stochastic Optimization Schemes for Performative Prediction with Nonconvex Loss

TL;DR

, with bias scaling as

depending on gradient variance. An extension to lazy deployment with epoch length

shows reduced bias on the order of

and

grow. Numerical experiments on synthetic and real data corroborate the theory, providing insight into stability and efficiency of stochastic methods in nonconvex performative settings.

Abstract

Paper Structure (16 sections, 6 theorems, 59 equations, 8 figures, 1 table)

This paper contains 16 sections, 6 theorems, 59 equations, 8 figures, 1 table.

Introduction
Stationary Condition for Performative Stability
Main Results
Sufficient Conditions for Convergence of SGD-GD
Convergence of SGD-GD with Non-convex Loss
Extension: Lazy Deployment Scheme with SGD
Numerical Experiments
Conclusions
Proof of Lemma \ref{['lem:descent']}
Proof of Lemma \ref{['lem:J']}
Proof of Lemma \ref{['lem:J2']}
Proof of Theorem \ref{['thm1']}
Proof of Theorem \ref{['thm2']}
Additional Numerical Results
Synthetic Data with Linear Model
...and 1 more sections

Key Result

Lemma 1

Under Aassu:lip_grd, assu:var. Suppose that the step size satisfies $\sup_{t \geq 1} \gamma_t \leq 1 / ( L (1+\sigma_1^2) )$, then for any $t \geq 0$, the sequence of iterates $\{ {\bm \theta}_t \}_{t \geq 0}$ generated by SGD-GD algo:sgd1 satisfies

Figures (8)

Figure 1: Comparison of Results in Existing Works. 'Sensitivity' indicates the distance metric imposed on ${\cal D}({\bm \theta})$ when the latter is subject to perturbation, given in the form $d( {\cal D}({\bm \theta}), {\cal D}({\bm \theta}') ) \leq \epsilon \| {\bm \theta} - {\bm \theta}' \|$ such that $d(\cdot,\cdot)$ is a distance metric between distributions. '$\theta_{\infty}$' indicates the type of convergent points: 'PS' refers to performative stable solution [cf. \ref{['eq:ps']}], 'SPS' refers to Def. \ref{['def:sps']}. ${}^\dagger$izzo2021learn assumed that ${\cal D}({\bm \theta})$ belongs to the location family, i.e., ${\cal D}({\bm \theta}) = {\cal N}(f({\bm \theta}); \sigma^2)$. ${}^\ddagger$mofakhami2023performative considered $\ell({\bm \theta}; z) = \tilde{\ell}(f_{{\bm \theta}}(x), y)$ with strongly convex $\tilde{\ell}(\cdot,y)$. The RRM requires solving a non-convex optimization at each recursion. ${}^\star$SGD-Lazy refers to the SGD method with lazy deployment scheme, which fixes the deployed model for $K$ iterations before the next deployment; see §\ref{['sec:lazy']}.
Figure 1: Synthetic Data (left) SPS measure $\| {\nabla} J( {\bm \theta}_t; {\bm \theta}_t) \|^2$ of SGD-GD against iteration no. $t$. (middle) Loss value $J({\bm \theta}_t; {\bm \theta}_t)$ of SGD-GD against iteration no. $t$. (right) SPS measure $\| {\nabla} J( {\bm \theta}_t; {\bm \theta}_t) \|^2$ of greedy (SGD-GD) and lazy deployment against number of sample accessed. We fix $\epsilon_L=2$.
Figure 2: Real Data with Neural Network Benchmarking with SPS measure $\| {\nabla} J( {\bm \theta}_t; {\bm \theta}_t ) \|^2$. (left) Against $t$ for SGD-GD with parameters $\epsilon_{\sf NN} \in \{0, 10, 100 \}$. (middle & right) Against no. of samples with greedy (SGD-GD) and lazy deployment when $\epsilon_{\sf NN} =10$ & $\epsilon_{\sf NN}=10^4$, respectively.
Figure 3: Synthetic Data (Left) Training accuracy under different sensitivity parameter $\epsilon_L$. (Right) Testing accuracy under different $\epsilon_L$.
Figure 4: Synthetic Data (Left) Loss $V({\bm \theta})$ against no of sample accessed. (Middle) Training accuracy under different sensitivity parameter $\epsilon_{L}$. (Right) Testing accuracy under different $\epsilon_{L}$.
...and 3 more figures

Theorems & Definitions (15)

Definition 1
Lemma 1
proof
Lemma 2
Lemma 3
Remark 1
Theorem 1
Corollary 1
Remark 2
Theorem 2
...and 5 more

Stochastic Optimization Schemes for Performative Prediction with Nonconvex Loss

TL;DR

Abstract

Stochastic Optimization Schemes for Performative Prediction with Nonconvex Loss

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (15)