Table of Contents
Fetching ...

Adjusting Pretrained Backbones for Performativity

Berker Demirel, Lingjing Kong, Kun Zhang, Theofanis Karaletsos, Celestine Mendler-Dünner, Francesco Locatello

TL;DR

This work provides a first baseline for addressing performativity in deep learning, by proposing a novel technique to adjust pretrained backbones for performativity in a modular way, achieving better sample efficiency and enabling the reuse of existing deep learning assets.

Abstract

With the widespread deployment of deep learning models, they influence their environment in various ways. The induced distribution shifts can lead to unexpected performance degradation in deployed models. Existing methods to anticipate performativity typically incorporate information about the deployed model into the feature vector when predicting future outcomes. While enjoying appealing theoretical properties, modifying the input dimension of the prediction task is often not practical. To address this, we propose a novel technique to adjust pretrained backbones for performativity in a modular way, achieving better sample efficiency and enabling the reuse of existing deep learning assets. Focusing on performative label shift, the key idea is to train a shallow adapter module to perform a Bayes-optimal label shift correction to the backbone's logits given a sufficient statistic of the model to be deployed. As such, our framework decouples the construction of input-specific feature embeddings from the mechanism governing performativity. Motivated by dynamic benchmarking as a use-case, we evaluate our approach under adversarial sampling, for vision and language tasks. We show how it leads to smaller loss along the retraining trajectory and enables us to effectively select among candidate models to anticipate performance degradations. More broadly, our work provides a first baseline for addressing performativity in deep learning.

Adjusting Pretrained Backbones for Performativity

TL;DR

This work provides a first baseline for addressing performativity in deep learning, by proposing a novel technique to adjust pretrained backbones for performativity in a modular way, achieving better sample efficiency and enabling the reuse of existing deep learning assets.

Abstract

With the widespread deployment of deep learning models, they influence their environment in various ways. The induced distribution shifts can lead to unexpected performance degradation in deployed models. Existing methods to anticipate performativity typically incorporate information about the deployed model into the feature vector when predicting future outcomes. While enjoying appealing theoretical properties, modifying the input dimension of the prediction task is often not practical. To address this, we propose a novel technique to adjust pretrained backbones for performativity in a modular way, achieving better sample efficiency and enabling the reuse of existing deep learning assets. Focusing on performative label shift, the key idea is to train a shallow adapter module to perform a Bayes-optimal label shift correction to the backbone's logits given a sufficient statistic of the model to be deployed. As such, our framework decouples the construction of input-specific feature embeddings from the mechanism governing performativity. Motivated by dynamic benchmarking as a use-case, we evaluate our approach under adversarial sampling, for vision and language tasks. We show how it leads to smaller loss along the retraining trajectory and enables us to effectively select among candidate models to anticipate performance degradations. More broadly, our work provides a first baseline for addressing performativity in deep learning.
Paper Structure (32 sections, 1 theorem, 7 equations, 7 figures, 4 tables, 1 algorithm)

This paper contains 32 sections, 1 theorem, 7 equations, 7 figures, 4 tables, 1 algorithm.

Key Result

Proposition 3.1

Assume the pretrained model $f$ accurately represents the likelihood of the training data. Then, if performativity only surfaces in the marginal $P(Y)$, and $P(Y|X)$ is unaffected by performativity, there exists a predictor $T$ such that $F$ recovers $f_{\mathrm{perf}}$.

Figures (7)

  • Figure 1: Setup: In each round a model is deployed to make predictions $\hat{Y}$ over $P_t$. These predictions give rise to a new distribution $P_{t+1}$. To achieve high accuracy after deployment, we equip existing backbones with an adapter module to build a performativity-aware predictor. The adapter module seeks to predict the next distribution based on the sufficient statistic $S$ for the shift, and adjusts the predictions accordingly. Under performativity the $S$ is a function of the deployed model.
  • Figure 2: Accuracy along retraining trajectory for vision tasks. Each method starts from the same pretrained model, evaluated on the balanced dataset at $t=0$. Starting from $t=1$, we simulate $200$ rounds of deployments with performative shift of varying strength. The Performative-aware Predictor (PaP) performs well even under the high shift scenario, approaching Bayes-optimal update performance as it is trained over rounds. The inset plot zooms in on the performance up to the first checkpoint. As it learns the structure, it typically adapts to the shift within the first $10$ updates.
  • Figure 3: Accuracy along retraining trajectory for language tasks. The Oracle fine-tuning method is more sensitive to shifts in language datasets. Again, $t=0$ refers to the balanced training accuracy. Similar to the vision case, the Performative-aware Predictor (PaP) performs well under different shift scenarios, increasing its proximity to the Bayes-optimal Oracle distribution performance as it is trained over rounds. The inset plot provides a detailed view of the initial performance, focusing on the model's learning curve within the first $10$ updates.
  • Figure 4: Modularity of the architecture. We conduct a model switching experiment where we replace the backbone within PaP. PaP still outperforms the No Adaptation baseline consistently, even with models it wasn't originally trained with.
  • Figure 5: Anticipating performativity. Average performance gain over No Adaptation in high shift scenarios. Oracle fine-tuning performs significantly worse than No Adaptation, as it does not anticipate the shift. In contrast, PaP achieves consistent gains and performs comparable to the oracle baseline.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Proposition 3.1
  • proof