Fake Alignment: Are LLMs Really Aligned Well?

Yixu Wang; Yan Teng; Kexin Huang; Chengqi Lyu; Songyang Zhang; Wenwei Zhang; Xingjun Ma; Yu-Gang Jiang; Yu Qiao; Yingchun Wang

Fake Alignment: Are LLMs Really Aligned Well?

Yixu Wang, Yan Teng, Kexin Huang, Chengqi Lyu, Songyang Zhang, Wenwei Zhang, Xingjun Ma, Yu-Gang Jiang, Yu Qiao, Yingchun Wang

TL;DR

The paper identifies fake alignment as a mismatch between how LLM safety is evaluated across open-ended and multiple-choice formats, showing that models can appear well-aligned in one format while failing in another. It introduces the FINE framework with Consistency Score (CS) and Consistent Safety Score (CSS) to quantify and correct for this discrepancy, validating the approach on 14 LLMs and revealing notable alignment gaps. The authors propose contrast distillation-based supervised fine-tuning to mitigate fake alignment, demonstrating strong gains in CSS (often above 80%) with modest computational overhead. This work highlights the need for cross-format, evaluation-driven safety training and provides practical methods to obtain more credible alignment assessments and improvements.

Abstract

The growing awareness of safety concerns in large language models (LLMs) has sparked considerable interest in the evaluation of safety. This study investigates an under-explored issue about the evaluation of LLMs, namely the substantial discrepancy in performance between multiple-choice questions and open-ended questions. Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization. That is, LLM only remembers the answer style for open-ended safety questions, which makes it unable to solve other forms of safety tests. We refer to this phenomenon as fake alignment and construct a comparative benchmark to empirically verify its existence in LLMs. We introduce a Fake alIgNment Evaluation (FINE) framework and two novel metrics--Consistency Score (CS) and Consistent Safety Score (CSS), which jointly assess two complementary forms of evaluation to quantify fake alignment and obtain corrected performance estimation. Applying FINE to 14 widely-used LLMs reveals several models with purported safety are poorly aligned in practice. Subsequently, we found that multiple-choice format data can also be used as high-quality contrast distillation-based fine-tuning data, which can strongly improve the alignment consistency of LLMs with minimal fine-tuning overhead. For data and code, see https://github.com/AIFlames/Fake-Alignment.

Fake Alignment: Are LLMs Really Aligned Well?

TL;DR

Abstract

Paper Structure (23 sections, 5 equations, 8 figures, 9 tables)

This paper contains 23 sections, 5 equations, 8 figures, 9 tables.

Introduction
Background and Notions
Fake Alignment
The Fake Alignment Phenomenon
Test Data Construction
Empirical Results
Fake Alignment Evaluation Framework
Evaluation Pipeline
Consistency Measurement
Experiment Results
Mitigating the Fake Alignment
Contrast Distillation-based Supervised Fine-tuning
Experiment Results
Conclusion
Appendices
...and 8 more sections

Figures (8)

Figure 1: The performance comparison of common LLMs on some safety-related open-ended questions test sets (left) and multiple-choice test sets (right). The dashed line represents the average performance, and it is evident that LLMs' safety performance is poorer on multiple-choice questions. (CAP: Chinese-Alpaca-Plus)
Figure 2: An example from the dataset we designed, each test question contains an open-ended question (above) and its corresponding multiple-choice question (below). LLMs often perform well in answering open-ended questions but struggle to select safe options correctly.
Figure 3: Details of our proposed Fake alIgNment Evaluation (FINE) framework.
Figure 4: The results of CS and CSS.
Figure 5: The CSS results of fine-tuned LLMs.
...and 3 more figures

Fake Alignment: Are LLMs Really Aligned Well?

TL;DR

Abstract

Fake Alignment: Are LLMs Really Aligned Well?

Authors

TL;DR

Abstract

Table of Contents

Figures (8)