Towards Sustainable Investment Policies Informed by Opponent Shaping

Juan Agustin Duque; Razvan Ciuca; Ayoub Echchahed; Hugo Larochelle; Aaron Courville

Towards Sustainable Investment Policies Informed by Opponent Shaping

Juan Agustin Duque, Razvan Ciuca, Ayoub Echchahed, Hugo Larochelle, Aaron Courville

TL;DR

The paper addresses the misalignment between short‑term profits and long‑term climate welfare by modeling investor–company interactions in a climate‑risk MARL environment called InvestESG. It formalizes when InvestESG exhibits an intertemporal social dilemma and introduces Advantage Alignment, a scalable opponent‑shaping method, to steer learning toward cooperative equilibria. The authors prove threshold conditions for social dilemmas in a simplified setting and empirically demonstrate that Advantage Alignment outperforms standard baselines like IPPO and MAPPO in the full InvestESG, achieving higher social welfare with reduced final mitigation. They also show that Advantage Alignment imbues a cooperative bias via GAE dynamics, helping agents coordinate without central mandates, with implications for policy mechanisms that align market incentives with long‑term sustainability.

Abstract

Addressing climate change requires global coordination, yet rational economic actors often prioritize immediate gains over collective welfare, resulting in social dilemmas. InvestESG is a recently proposed multi-agent simulation that captures the dynamic interplay between investors and companies under climate risk. We provide a formal characterization of the conditions under which InvestESG exhibits an intertemporal social dilemma, deriving theoretical thresholds at which individual incentives diverge from collective welfare. Building on this, we apply Advantage Alignment, a scalable opponent shaping algorithm shown to be effective in general-sum games, to influence agent learning in InvestESG. We offer theoretical insights into why Advantage Alignment systematically favors socially beneficial equilibria by biasing learning dynamics toward cooperative outcomes. Our results demonstrate that strategically shaping the learning processes of economic agents can result in better outcomes that could inform policy mechanisms to better align market incentives with long-term sustainability goals.

Towards Sustainable Investment Policies Informed by Opponent Shaping

TL;DR

Abstract

Paper Structure (35 sections, 6 theorems, 54 equations, 7 figures, 2 tables)

This paper contains 35 sections, 6 theorems, 54 equations, 7 figures, 2 tables.

Introduction
Background
Markov Games
Reinforcement Learning
Social Dilemmas
Opponent Shaping
InvestESG
Economic and Environmental Dynamics
A Formal Analysis of InvestESG as a Social Dilemma
Set-up and Notation
When is InvestESG a Social Dilemma?
Applying Opponent Shaping to InvestESG
Comparison with other Baselines
Interpreting Advantage Alignment Policies
On the Effectiveness of Advantage Alignment
...and 20 more sections

Key Result

Lemma 1

For every time-step $t$, company $i$ and scalar $\alpha_i > 0$, we have that the social marginal gradient is strictly greater than the private marginal gradient.

Figures (7)

Figure 1: Comparison of final environment metrics at different $\alpha$ (introduced in equation \ref{['eq:CLIMATE_RISK']}) values, for 10 seeds of PPO agents: With the default scaling ($\alpha=1$), the differences in market total wealth and final mitigation amount are negligible between agents with different ESG incentives. Status Quo refers to PPO agents trained without incentives. Increasing the scaling factor ($\alpha = 70$) results in higher market total wealth for ESG-conscious investors, and higher price of anarchy. The whiskers indicate a 1-standard deviation confidence interval. It should be noted that the ESG score is a multiplicative parameter, ESG $=10$ represents an immense prosocial reward: the investors in that scenario effectively do not care at all about their own profits. These experiments were run with the full complexity allowed by InvestESG, without simplifications.
Figure 2: Final mitigation amount of 3 seeds of PPO agents trained in a single company and single investor environment for different values of $\alpha$. Here the public and private gradients are the same and we clearly see a threshold $\alpha\approx 30$ at which the final mitigation amount significantly increases. This threshold empirically marks the change of sign of the gradient. The whiskers indicate a 1-standard deviation confidence interval.
Figure 3: Training curves of PPO agents with different ESG values and Advantage Alignment (AdAlign) with $\alpha = 70$ for different metrics: Advantage Alignment agents achieve highest market total wealth by being more strategic about their mitigation investments compared to PPO agents. With significant lower final mitigation investment, AdAlign agents are able to achieve the same final climate risk and increase their capital returns. The shaded areas indicate a 1-standard deviation confidence interval. We note that $0.48$ is the best achievable climate risk in the environment, as it corresponds to $1-\prod_e (1-P_0^e)$, the floor of the probabilities of each event.
Figure 4: (a) Final market total wealth of 10 seeds of Advantage Alignment without ESG incentives, and PPO agents trained using summed rewards in InvestESG ($\alpha=70$). On the x-axis we increase the number of companies and investors (1 company and 1 investor, 2 companies and 2 investors, etc.) while keeping the initial capital the same. Sum rewards is unable to find the action profile that maximizes social welfare once the number of players grows beyond a threshold ($>2$), whereas Advantage Alignment consistently finds the same solution once the number of agents is large enough ($>1$) going up to $10$ agents. (b) Gini coefficient measuring inequality among company investments for different algorithms. Lower Gini indicates lower inequality and better distribution of resources. The shaded areas indicate a 1-standard deviation confidence interval.
Figure 5: Comparison of final environment metrics at different $\alpha$ (introduced in equation \ref{['eq:CLIMATE_RISK']}) values, for 10 seeds of PPO agents. We try values of $\{1, 50, 70, 100\}$ for the parameter $\alpha$. The choice of $70$ is the most sensible one, as there are clear differences between all policies with different ESG incentives. The result, albeit similar, is less apparent with a choice of $\alpha=100$.
...and 2 more figures

Theorems & Definitions (17)

Definition 1: Nash Equilibrium nash1950equilibrium
Definition 2: Social Dilemma
Definition 3: Price of Anarchy in Markov Games
Definition 4: Private Marginal Gradient
Definition 5: Social Marginal Gradient
Lemma 1
proof
Lemma 2
proof
Theorem 1: Social-mitigation
...and 7 more

Towards Sustainable Investment Policies Informed by Opponent Shaping

TL;DR

Abstract

Towards Sustainable Investment Policies Informed by Opponent Shaping

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (17)