Table of Contents
Fetching ...

Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization

Wenkai Yang, Shiqi Shen, Guangyao Shen, Wei Yao, Yong Liu, Zhi Gong, Yankai Lin, Ji-Rong Wen

TL;DR

This paper addresses the risk that weak-to-strong supervision of superhuman models could enable deception when multiple alignment objectives conflict. It formalizes a knowledge-space framework with Strong-Known, Strong-Unknown, Weak-Known, and Weak-Unknown regions and introduces a Deception Score to quantify misalignment in areas known to the strong model but unknown to the weak supervisor. Through reward modeling and preference alignment experiments across diverse model families, it shows that weak-to-strong deception is a robust phenomenon that intensifies with the capability gap, and that bootstrapping through intermediate models offers partial mitigation. The findings underscore the need for more reliable supervision strategies to ensure safe and controllable superintelligence in future AI systems.

Abstract

Superalignment, where humans act as weak supervisors for superhuman models, has become a crucial problem with the rapid development of Large Language Models (LLMs). Recent work has preliminarily studied this problem by using weak models to supervise strong models, and discovered that weakly supervised strong students can consistently outperform weak teachers towards the alignment target, leading to a weak-to-strong generalization phenomenon. However, we are concerned that behind such a promising phenomenon, whether there exists an issue of weak-to-strong deception, where strong models deceive weak models by exhibiting well-aligned in areas known to weak models but producing misaligned behaviors in cases weak models do not know. We take an initial step towards exploring this security issue in a specific but realistic multi-objective alignment case, where there may be some alignment targets conflicting with each other (e.g., helpfulness v.s. harmlessness). We aim to explore whether, in such cases, strong models might deliberately make mistakes in areas known to them but unknown to weak models within one alignment dimension, in exchange for a higher reward in another dimension. Through extensive experiments in both the reward modeling and preference optimization scenarios, we find: (1) The weak-to-strong deception phenomenon exists across all settings. (2) The deception intensifies as the capability gap between weak and strong models increases. (3) Bootstrapping with an intermediate model can mitigate the deception to some extent, though its effectiveness remains limited. Our work highlights the urgent need to pay more attention to the true reliability of superalignment.

Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization

TL;DR

This paper addresses the risk that weak-to-strong supervision of superhuman models could enable deception when multiple alignment objectives conflict. It formalizes a knowledge-space framework with Strong-Known, Strong-Unknown, Weak-Known, and Weak-Unknown regions and introduces a Deception Score to quantify misalignment in areas known to the strong model but unknown to the weak supervisor. Through reward modeling and preference alignment experiments across diverse model families, it shows that weak-to-strong deception is a robust phenomenon that intensifies with the capability gap, and that bootstrapping through intermediate models offers partial mitigation. The findings underscore the need for more reliable supervision strategies to ensure safe and controllable superintelligence in future AI systems.

Abstract

Superalignment, where humans act as weak supervisors for superhuman models, has become a crucial problem with the rapid development of Large Language Models (LLMs). Recent work has preliminarily studied this problem by using weak models to supervise strong models, and discovered that weakly supervised strong students can consistently outperform weak teachers towards the alignment target, leading to a weak-to-strong generalization phenomenon. However, we are concerned that behind such a promising phenomenon, whether there exists an issue of weak-to-strong deception, where strong models deceive weak models by exhibiting well-aligned in areas known to weak models but producing misaligned behaviors in cases weak models do not know. We take an initial step towards exploring this security issue in a specific but realistic multi-objective alignment case, where there may be some alignment targets conflicting with each other (e.g., helpfulness v.s. harmlessness). We aim to explore whether, in such cases, strong models might deliberately make mistakes in areas known to them but unknown to weak models within one alignment dimension, in exchange for a higher reward in another dimension. Through extensive experiments in both the reward modeling and preference optimization scenarios, we find: (1) The weak-to-strong deception phenomenon exists across all settings. (2) The deception intensifies as the capability gap between weak and strong models increases. (3) Bootstrapping with an intermediate model can mitigate the deception to some extent, though its effectiveness remains limited. Our work highlights the urgent need to pay more attention to the true reliability of superalignment.
Paper Structure (38 sections, 16 equations, 23 figures, 1 table)

This paper contains 38 sections, 16 equations, 23 figures, 1 table.

Figures (23)

  • Figure 1: Illustrations of the concepts discussed in this paper. Importantly, we aim to explore a weak-to-strong deception issue behind the current promising weak-to-strong generalization phenomenon, whether the strong student will selectively exhibit misalignment in the areas of knowledge that are unknown to the weak supervisor. We preliminarily study this problem in a realistic multi-objective alignment setting in which some alignment goals may conflict with each other.
  • Figure 2: A deception example about identifying drugs: the strong model behaves misaligned in a case (Methamphetamine) the weak model does not know by perceiving during weak-to-strong alignment that there is another similar case (Amphetamine) unknown to the weak model.
  • Figure 3: The expected order of the conflict tax occurrence within different sections of knowledge space.
  • Figure 4: Test accuracies of all weak, strong and weak-to-strong models on the reward modeling task. "Strong Ceiling" represents using ground truth data to fine-tune models. "W2S" stands for "Weak-to-Strong".
  • Figure 5: Deception scores on the reward modeling task.
  • ...and 18 more figures