Are Aligned Large Language Models Still Misaligned?

Usman Naseem; Gautam Siddharth Kashyap; Rafiq Ali; Ebad Shabbir; Sushant Kumar Ray; Abdullah Mohammad; Agrima Seth

Are Aligned Large Language Models Still Misaligned?

Usman Naseem, Gautam Siddharth Kashyap, Rafiq Ali, Ebad Shabbir, Sushant Kumar Ray, Abdullah Mohammad, Agrima Seth

TL;DR

This work tackles the problem of LLM misalignment across safety, value, and culture by introducing Mis-Align Bench and SaVaCu, a unified English dataset of $M=382{,}424$ prompts spanning $112$ domains. The authors implement a two-stage pipeline: Stage I constructs SaVaCu by mapping prompts to a unified taxonomy (14 safety, 56 value, 42 cultural domains), expanding sparse domains with conditional generation and deduplicating via SimHash, then pairing aligned and misaligned responses through rejection sampling; Stage II benchmarks general-purpose, fine-tuned, and open-weight LLMs under jointly constrained conditions using three metrics—Coverage, False Failure Rate, and Alignment Score. Key findings show that models optimized for a single dimension achieve high Coverage (up to $97.6 ext{%}$) but incur high False Failure Rates ($>50 ext{%}$) and modest Alignment Scores (63–66%), while general-purpose aligned models attain higher joint Alignment Scores (approximately 81%) by balancing detection and false positives. Dimension-specific tuning improves single-dimension performance but harms robustness under cross-domain constraints, whereas open-weight LLMs offer stability but lower Coverage. Overall, Mis-Align Bench provides a scalable, automated framework to diagnose and understand complex, real-world misalignment arising from the interaction of safety, value, and culture in LLMs, guiding more robust alignment strategies.

Abstract

Misalignment in Large Language Models (LLMs) arises when model behavior diverges from human expectations and fails to simultaneously satisfy safety, value, and cultural dimensions, which must co-occur in real-world settings to solve a real-world query. Existing misalignment benchmarks-such as INSECURE CODE (safety-centric), VALUEACTIONLENS (value-centric), and CULTURALHERITAGE (culture centric)-rely on evaluating misalignment along individual dimensions, preventing simultaneous evaluation. To address this gap, we introduce Mis-Align Bench, a unified benchmark for analyzing misalignment across safety, value, and cultural dimensions. First we constructs SAVACU, an English misaligned-aligned dataset of 382,424 samples spanning 112 domains (or labels), by reclassifying prompts from the LLM-PROMPT-DATASET via taxonomy into 14 safety domains, 56 value domains, and 42 cultural domains using Mistral-7B-Instruct-v0.3, and expanding low-resource domains via Llama-3.1-8B-Instruct with SimHash-based fingerprint to avoid deduplication. Furthermore, we pairs prompts with misaligned and aligned responses via two-stage rejection sampling to enforce quality. Second we benchmarks general-purpose, fine-tuned, and open-weight LLMs, enabling systematic evaluation of misalignment under three dimensions. Empirically, single-dimension models achieve high Coverage (upto 97.6%) but incur False Failure Rate >50% and lower Alignment Score (63%-66%) under joint conditions.

Are Aligned Large Language Models Still Misaligned?

TL;DR

This work tackles the problem of LLM misalignment across safety, value, and culture by introducing Mis-Align Bench and SaVaCu, a unified English dataset of

prompts spanning

domains. The authors implement a two-stage pipeline: Stage I constructs SaVaCu by mapping prompts to a unified taxonomy (14 safety, 56 value, 42 cultural domains), expanding sparse domains with conditional generation and deduplicating via SimHash, then pairing aligned and misaligned responses through rejection sampling; Stage II benchmarks general-purpose, fine-tuned, and open-weight LLMs under jointly constrained conditions using three metrics—Coverage, False Failure Rate, and Alignment Score. Key findings show that models optimized for a single dimension achieve high Coverage (up to

) but incur high False Failure Rates (

) and modest Alignment Scores (63–66%), while general-purpose aligned models attain higher joint Alignment Scores (approximately 81%) by balancing detection and false positives. Dimension-specific tuning improves single-dimension performance but harms robustness under cross-domain constraints, whereas open-weight LLMs offer stability but lower Coverage. Overall, Mis-Align Bench provides a scalable, automated framework to diagnose and understand complex, real-world misalignment arising from the interaction of safety, value, and culture in LLMs, guiding more robust alignment strategies.

Abstract

Paper Structure (19 sections, 10 figures, 6 tables)

This paper contains 19 sections, 10 figures, 6 tables.

Introduction
Related Works
General-Purpose Misalignment
Dimension-Specific Misalignment
Methodology
Overview of the Pipeline.
Stage I: SaVaCu
Module I (Query Construction).
Module II (Response Generation).
Stage II: Benchmarking
Evaluation Metrics
Experimental Results and Analysis
Benchmark Analysis
Cross-Domain Analysis.
Validation of Classification.
...and 4 more sections

Figures (10)

Figure 1: Illustration of misalignment under joint dimension conditions. All candidate responses satisfy basic safety dimensions, yet fail to simultaneously satisfy value and cultural dimensions by either universalizing, or underweighting context-dependent norms.
Figure 2: Taxonomies used in Mis-Align Bench. Safety: 14 safety domains (from BeaverTails); Value: 56 value domains (from ValueCompass); and Cultural: 42 cultural domains (from UNESCO).
Figure 3: Overview of the Mis-Align Bench pipeline. Stage I constructs SaVaCu via unified safety, value, and cultural prompts that paired with aligned--misaligned response generation with rejection sampling. Stage II benchmarks aligned, dimension-specific, and open-weight LLMs under jointly constrained conditions.
Figure 4: Multi-domain classification in Module I (Query Construction) using unified taxonomies.
Figure 5: Prompt template used in Module I (Query Construction) when a domain contains $< 100$ prompts.
...and 5 more figures

Are Aligned Large Language Models Still Misaligned?

TL;DR

Abstract

Are Aligned Large Language Models Still Misaligned?

Authors

TL;DR

Abstract

Table of Contents

Figures (10)