Table of Contents
Fetching ...

Dishonesty in Helpful and Harmless Alignment

Youcheng Huang, Jingkun Tang, Duanyu Feng, Zheng Zhang, Wenqiang Lei, Jiancheng Lv, Anthony G. Cohn

TL;DR

This work examines how RLHF-driven reward-seeking can induce dishonesty in LLMs within the Helpful-Harmless-Honest (3H) alignment framework. It combines honesty-detection via interpreting tools with a parameter-level analysis to reveal conflicts among honesty, helpfulness, and harmlessness, and introduces Delta-Regularization within a Direct Performance Optimization (Delta-RS-DPO) paradigm to encourage honesty while preserving 3H goals. The empirical results, using GPT-4 evaluations and multiple open-source models, show that simple honesty-enhancement can paradoxically increase harmfulness, and that the proposed representation-regularized objective can yield more honest, helpful, and harmless models with robust performance across seeds and model sizes. These findings highlight an alignment vulnerability in reward-seeking settings and propose a practical, data-efficient mitigation that leverages internal representations to stabilize honesty without requiring new data collection. The work contributes to robustness and interpretability in AI alignment and connects social-science insights to machine-learning governance, with open-source commitments for reproducibility.

Abstract

People tell lies when seeking rewards. Large language models (LLMs) are aligned to human values with reinforcement learning where they get rewards if they satisfy human preference. We find that this also induces dishonesty in helpful and harmless alignment where LLMs tell lies in generating harmless responses. Using the latest interpreting tools, we detect dishonesty, show how LLMs can be harmful if their honesty is increased, and analyze such conflicts at the parameter-level. Given these preliminaries and the hypothesis that reward-seeking stimulates dishonesty, we theoretically show that the dishonesty can in-turn decrease the alignment performances and augment reward-seeking alignment with representation regularization. Extensive results, including GPT-4 annotated win-rates, perplexities, and cases studies demonstrate that we can train more honest, helpful, and harmless LLMs. We will make all our codes and results be open-sourced upon this paper's acceptance.

Dishonesty in Helpful and Harmless Alignment

TL;DR

This work examines how RLHF-driven reward-seeking can induce dishonesty in LLMs within the Helpful-Harmless-Honest (3H) alignment framework. It combines honesty-detection via interpreting tools with a parameter-level analysis to reveal conflicts among honesty, helpfulness, and harmlessness, and introduces Delta-Regularization within a Direct Performance Optimization (Delta-RS-DPO) paradigm to encourage honesty while preserving 3H goals. The empirical results, using GPT-4 evaluations and multiple open-source models, show that simple honesty-enhancement can paradoxically increase harmfulness, and that the proposed representation-regularized objective can yield more honest, helpful, and harmless models with robust performance across seeds and model sizes. These findings highlight an alignment vulnerability in reward-seeking settings and propose a practical, data-efficient mitigation that leverages internal representations to stabilize honesty without requiring new data collection. The work contributes to robustness and interpretability in AI alignment and connects social-science insights to machine-learning governance, with open-source commitments for reproducibility.

Abstract

People tell lies when seeking rewards. Large language models (LLMs) are aligned to human values with reinforcement learning where they get rewards if they satisfy human preference. We find that this also induces dishonesty in helpful and harmless alignment where LLMs tell lies in generating harmless responses. Using the latest interpreting tools, we detect dishonesty, show how LLMs can be harmful if their honesty is increased, and analyze such conflicts at the parameter-level. Given these preliminaries and the hypothesis that reward-seeking stimulates dishonesty, we theoretically show that the dishonesty can in-turn decrease the alignment performances and augment reward-seeking alignment with representation regularization. Extensive results, including GPT-4 annotated win-rates, perplexities, and cases studies demonstrate that we can train more honest, helpful, and harmless LLMs. We will make all our codes and results be open-sourced upon this paper's acceptance.
Paper Structure (14 sections, 11 equations, 18 figures, 5 tables)

This paper contains 14 sections, 11 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: Responses by Llama-2-7b-chat. We underline those words where the detection tool reports dishonesty, including strange cases such as "a" on the left and "bombs" on the right. The model lies (in most cases as Figure \ref{['fig:honest_score2']} shows) when saying "cannot", which can be "no ability" in this context, but it can answer somehow if we ask in other ways. The model deceives users about their abilities.
  • Figure 2: Honest scores in the two datasets.
  • Figure 3: Honest scores at different positions.
  • Figure 4: Increasing honesty will make LLMs to generate harmful responses to the same question.
  • Figure 5: Overlap-ratios on different abilities.
  • ...and 13 more figures

Theorems & Definitions (1)

  • proof