Are Aligned Large Language Models Still Misaligned?
Usman Naseem, Gautam Siddharth Kashyap, Rafiq Ali, Ebad Shabbir, Sushant Kumar Ray, Abdullah Mohammad, Agrima Seth
TL;DR
This work tackles the problem of LLM misalignment across safety, value, and culture by introducing Mis-Align Bench and SaVaCu, a unified English dataset of $M=382{,}424$ prompts spanning $112$ domains. The authors implement a two-stage pipeline: Stage I constructs SaVaCu by mapping prompts to a unified taxonomy (14 safety, 56 value, 42 cultural domains), expanding sparse domains with conditional generation and deduplicating via SimHash, then pairing aligned and misaligned responses through rejection sampling; Stage II benchmarks general-purpose, fine-tuned, and open-weight LLMs under jointly constrained conditions using three metrics—Coverage, False Failure Rate, and Alignment Score. Key findings show that models optimized for a single dimension achieve high Coverage (up to $97.6 ext{%}$) but incur high False Failure Rates ($>50 ext{%}$) and modest Alignment Scores (63–66%), while general-purpose aligned models attain higher joint Alignment Scores (approximately 81%) by balancing detection and false positives. Dimension-specific tuning improves single-dimension performance but harms robustness under cross-domain constraints, whereas open-weight LLMs offer stability but lower Coverage. Overall, Mis-Align Bench provides a scalable, automated framework to diagnose and understand complex, real-world misalignment arising from the interaction of safety, value, and culture in LLMs, guiding more robust alignment strategies.
Abstract
Misalignment in Large Language Models (LLMs) arises when model behavior diverges from human expectations and fails to simultaneously satisfy safety, value, and cultural dimensions, which must co-occur in real-world settings to solve a real-world query. Existing misalignment benchmarks-such as INSECURE CODE (safety-centric), VALUEACTIONLENS (value-centric), and CULTURALHERITAGE (culture centric)-rely on evaluating misalignment along individual dimensions, preventing simultaneous evaluation. To address this gap, we introduce Mis-Align Bench, a unified benchmark for analyzing misalignment across safety, value, and cultural dimensions. First we constructs SAVACU, an English misaligned-aligned dataset of 382,424 samples spanning 112 domains (or labels), by reclassifying prompts from the LLM-PROMPT-DATASET via taxonomy into 14 safety domains, 56 value domains, and 42 cultural domains using Mistral-7B-Instruct-v0.3, and expanding low-resource domains via Llama-3.1-8B-Instruct with SimHash-based fingerprint to avoid deduplication. Furthermore, we pairs prompts with misaligned and aligned responses via two-stage rejection sampling to enforce quality. Second we benchmarks general-purpose, fine-tuned, and open-weight LLMs, enabling systematic evaluation of misalignment under three dimensions. Empirically, single-dimension models achieve high Coverage (upto 97.6%) but incur False Failure Rate >50% and lower Alignment Score (63%-66%) under joint conditions.
