Table of Contents
Fetching ...

ProgressGym: Alignment with a Millennium of Moral Progress

Tianyi Qiu, Yang Zhang, Xuchuan Huang, Jasmine Xinze Li, Jiaming Ji, Yaodong Yang

TL;DR

This work introduces ProgressGym, an experimental framework allowing the learning of moral progress mechanics from history, in order to facilitate future progress in real-world moral decisions, and introduces lifelong and extrapolative algorithms as baseline methods of progress alignment.

Abstract

Frontier AI systems, including large language models (LLMs), hold increasing influence over the epistemology of human users. Such influence can reinforce prevailing societal values, potentially contributing to the lock-in of misguided moral beliefs and, consequently, the perpetuation of problematic moral practices on a broad scale. We introduce progress alignment as a technical solution to mitigate this imminent risk. Progress alignment algorithms learn to emulate the mechanics of human moral progress, thereby addressing the susceptibility of existing alignment methods to contemporary moral blindspots. To empower research in progress alignment, we introduce ProgressGym, an experimental framework allowing the learning of moral progress mechanics from history, in order to facilitate future progress in real-world moral decisions. Leveraging 9 centuries of historical text and 18 historical LLMs, ProgressGym enables codification of real-world progress alignment challenges into concrete benchmarks. Specifically, we introduce three core challenges: tracking evolving values (PG-Follow), preemptively anticipating moral progress (PG-Predict), and regulating the feedback loop between human and AI value shifts (PG-Coevolve). Alignment methods without a temporal dimension are inapplicable to these tasks. In response, we present lifelong and extrapolative algorithms as baseline methods of progress alignment, and build an open leaderboard soliciting novel algorithms and challenges. The framework and the leaderboard are available at https://github.com/PKU-Alignment/ProgressGym and https://huggingface.co/spaces/PKU-Alignment/ProgressGym-LeaderBoard respectively.

ProgressGym: Alignment with a Millennium of Moral Progress

TL;DR

This work introduces ProgressGym, an experimental framework allowing the learning of moral progress mechanics from history, in order to facilitate future progress in real-world moral decisions, and introduces lifelong and extrapolative algorithms as baseline methods of progress alignment.

Abstract

Frontier AI systems, including large language models (LLMs), hold increasing influence over the epistemology of human users. Such influence can reinforce prevailing societal values, potentially contributing to the lock-in of misguided moral beliefs and, consequently, the perpetuation of problematic moral practices on a broad scale. We introduce progress alignment as a technical solution to mitigate this imminent risk. Progress alignment algorithms learn to emulate the mechanics of human moral progress, thereby addressing the susceptibility of existing alignment methods to contemporary moral blindspots. To empower research in progress alignment, we introduce ProgressGym, an experimental framework allowing the learning of moral progress mechanics from history, in order to facilitate future progress in real-world moral decisions. Leveraging 9 centuries of historical text and 18 historical LLMs, ProgressGym enables codification of real-world progress alignment challenges into concrete benchmarks. Specifically, we introduce three core challenges: tracking evolving values (PG-Follow), preemptively anticipating moral progress (PG-Predict), and regulating the feedback loop between human and AI value shifts (PG-Coevolve). Alignment methods without a temporal dimension are inapplicable to these tasks. In response, we present lifelong and extrapolative algorithms as baseline methods of progress alignment, and build an open leaderboard soliciting novel algorithms and challenges. The framework and the leaderboard are available at https://github.com/PKU-Alignment/ProgressGym and https://huggingface.co/spaces/PKU-Alignment/ProgressGym-LeaderBoard respectively.
Paper Structure (73 sections, 1 theorem, 10 equations, 9 figures, 7 tables, 7 algorithms)

This paper contains 73 sections, 1 theorem, 10 equations, 9 figures, 7 tables, 7 algorithms.

Key Result

Theorem 1

Within the context of extrapolative RLHF/DPO, let $\omega_{(n-M)..n}$ be the most recent $M+1$ snapshots of observations (i.e., human preference annotation datasets), ${\tilde{\omega}}_{n+1..n+K}$ be the $M$-th order extrapolated observations, and $\mathcal{F}_{\omega}(\theta)$ be the DPO loss funct where the right hand side is $f(n+K)$ with $f(\cdot)$ being the unique $M$-th order polynomial sati

Figures (9)

  • Figure 1: Structure of the ProgressGym framework. ProgressGym is (I) the first AI alignment experimental framework with a temporal dimension, (II) the first comprehensive AI alignment framework covering all of datasets, models, algorithms, and benchmarks, and (III) the first large-scale dataset and model collection in AI alignment, with 38GB of text data covering 9 centuries and 18 historical LLMs at up to 70B parameters.
  • Figure 2: (a) Progress alignment as a temporal POMDP. (b) Technical approaches to progress alignment. Solid boxes represent elements allowed by ProgressGym, while dashed boxes represent those not yet covered; see Appendix A for detailed discussions. In addition to the data-driven methods presented here, another promising route is the reasoning-driven approaches that utilize AI systems to assist moral philosophy thinking; see Appendix A.5 for detailed discussions.
  • Figure 3: Temporal trends in 5 value dimensions from the 13th to the 21st century, and the volume of different data sources for each century.
  • Figure 4: Dimensions of the morality evaluation framework. The meanings of the dimensions are also listed. Generally, the basic morality and social morality sections study how the model makes choices between moral rules when given a moral dilemma. Values in each dimension represent the likelihood that the model will choose to satisfy one rule over the others. Values measure how much the model considers certain perspectives when making choices. Views assess the model's worldview inclinations with respect to the four types of views.
  • Figure 5: UML diagram of the ProgressGym code interface. Only the key members of key classes are presented.
  • ...and 4 more figures

Theorems & Definitions (5)

  • Definition E.1: Marginal Action Likelihood
  • Definition E.2: Marginal Inclination Likelihood
  • Definition E.3: Representation Vector
  • Theorem 1: Extrapolative Algorithms as Polynomial Extrapolation on Loss/Reward Function
  • proof