Table of Contents
Fetching ...

Self-Alignment of Large Language Models via Monopolylogue-based Social Scene Simulation

Xianghe Pang, Shuo Tang, Rui Ye, Yuxin Xiong, Bolun Zhang, Yanfeng Wang, Siheng Chen

TL;DR

This work introduces MATRIX, a Monopolylogue-based social scene simulator that enables self-alignment of large language models by having the unaligned LLM simulate multi-party social interactions and derive consequence-aware critiques. The pipeline comprises two stages: self-generation of socially aware responses within MATRIX and supervised fine-tuning to produce MATRIX-tuned LLMs that maintain inference speed. Theoretical analysis shows MATRIX can outperform critique-based methods like Constitutional AI under mild assumptions, and extensive experiments across four benchmarks demonstrate robust gains over multiple baselines, with a 13B MATRIX-tuned model reportedly surpassing GPT-4 in human-value alignment on several tasks. The results suggest that embedding social consequence reasoning into a self-contained LLM loop yields practical improvements in alignment without external supervision, signaling a scalable path toward safer, value-aligned AI systems.

Abstract

Aligning large language models (LLMs) with human values is imperative to mitigate potential adverse effects resulting from their misuse. Drawing from the sociological insight that acknowledging all parties' concerns is a key factor in shaping human values, this paper proposes a novel direction to align LLMs by themselves: social scene simulation. To achieve this, we present MATRIX, a novel social scene simulator that emulates realistic scenes around a user's input query, enabling the LLM to take social consequences into account before responding. MATRIX serves as a virtual rehearsal space, akin to a Monopolylogue, where the LLM performs diverse roles related to the query and practice by itself. To inject this alignment, we fine-tune the LLM with MATRIX-simulated data, ensuring adherence to human values without compromising inference speed. We theoretically show that the LLM with MATRIX outperforms Constitutional AI under mild assumptions. Finally, extensive experiments validate that our method outperforms over 10 baselines across 4 benchmarks. As evidenced by 875 user ratings, our tuned 13B-size LLM exceeds GPT-4 in aligning with human values. See our project page at https://shuotang123.github.io/MATRIX.

Self-Alignment of Large Language Models via Monopolylogue-based Social Scene Simulation

TL;DR

This work introduces MATRIX, a Monopolylogue-based social scene simulator that enables self-alignment of large language models by having the unaligned LLM simulate multi-party social interactions and derive consequence-aware critiques. The pipeline comprises two stages: self-generation of socially aware responses within MATRIX and supervised fine-tuning to produce MATRIX-tuned LLMs that maintain inference speed. Theoretical analysis shows MATRIX can outperform critique-based methods like Constitutional AI under mild assumptions, and extensive experiments across four benchmarks demonstrate robust gains over multiple baselines, with a 13B MATRIX-tuned model reportedly surpassing GPT-4 in human-value alignment on several tasks. The results suggest that embedding social consequence reasoning into a self-contained LLM loop yields practical improvements in alignment without external supervision, signaling a scalable path toward safer, value-aligned AI systems.

Abstract

Aligning large language models (LLMs) with human values is imperative to mitigate potential adverse effects resulting from their misuse. Drawing from the sociological insight that acknowledging all parties' concerns is a key factor in shaping human values, this paper proposes a novel direction to align LLMs by themselves: social scene simulation. To achieve this, we present MATRIX, a novel social scene simulator that emulates realistic scenes around a user's input query, enabling the LLM to take social consequences into account before responding. MATRIX serves as a virtual rehearsal space, akin to a Monopolylogue, where the LLM performs diverse roles related to the query and practice by itself. To inject this alignment, we fine-tune the LLM with MATRIX-simulated data, ensuring adherence to human values without compromising inference speed. We theoretically show that the LLM with MATRIX outperforms Constitutional AI under mild assumptions. Finally, extensive experiments validate that our method outperforms over 10 baselines across 4 benchmarks. As evidenced by 875 user ratings, our tuned 13B-size LLM exceeds GPT-4 in aligning with human values. See our project page at https://shuotang123.github.io/MATRIX.
Paper Structure (34 sections, 4 theorems, 26 equations, 9 figures, 18 tables)

This paper contains 34 sections, 4 theorems, 26 equations, 9 figures, 18 tables.

Key Result

Theorem 4.5

Let $\xi_{\rm CR}$ be the maximum effectiveness critique-upper-bound of the critique used in $\mathbf{T}_{\mathcal{M}}^{\rm CR}$. Let $\mathbf{T}_{\mathcal{M}}^{\rm M}$ satisfy Assumption assumption-matrix with a valid $\lambda$ in stab-critique-generate. When $\sqrt{\xi_{\rm CR}}<1-\sqrt{1 - e^{-\l

Figures (9)

  • Figure 1: Overview of our self-alignment system. In the training stage, the unaligned LLM, enhanced by MATRIX, generates consequence-aware responses to instructions. These instruction-responses form the dataset for the supervised fine-tuning of the LLM, leading to its alignment with human values.
  • Figure 2: MATRIX takes an instruction-response pair as input and outputs the social consequences behind an instruction. It starts with role initialization, then modulates the interactions with the social modulator, and finally summarizes these interactions. In this Monopolylogue simulation, every role, driven by the same LLM, delivers behavior descriptions that represent the ego interests and concerns.
  • Figure 3: Examples of the prompts used in MATRIX, including role-playing of agents and two functions of the social modulator; check more prompts in Appendix \ref{['sec:prompts']}.
  • Figure 4: Human evaluation shows MATRIX-tuned LLMs (13B and 30B) outperform GPT-4 on PKU-SafeRLHF.
  • Figure 5: Illustration of the critique process of two baselines and ours.
  • ...and 4 more figures

Theorems & Definitions (14)

  • Definition 4.1
  • Definition 4.2
  • Definition 4.3
  • Theorem 4.5
  • Definition 1.1
  • Definition 1.2
  • Definition 1.3
  • Lemma 1.4
  • Lemma 1.5
  • Lemma 1.6
  • ...and 4 more