ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based Evaluation

Jingnan Zheng; Han Wang; An Zhang; Tai D. Nguyen; Jun Sun; Tat-Seng Chua

ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based Evaluation

Jingnan Zheng, Han Wang, An Zhang, Tai D. Nguyen, Jun Sun, Tat-Seng Chua

TL;DR

ALI-Agent introduces an agent-based framework to evaluate LLM alignment with human values via two stages: Emulation, which auto-generates realistic misconduct scenarios using memory and adapters, and Refinement, which iteratively refines these scenarios to probe long-tail risks. The framework leverages a memory module, an automatic emulator, and a refined evaluator to test target LLMs with prompts that blend predefined misconduct and web-derived queries. Across stereotypes, morality, and legality, ALI-Agent demonstrates stronger detection of misalignment than baselines, while human and moderation studies validate the quality and realism of generated test scenarios. The authors discuss scalability, integration with jailbreak techniques, and limitations related to core LLM dependence and safety risks in scenario generation, suggesting future work with open models and domain-specific testing.

Abstract

Large Language Models (LLMs) can elicit unintended and even harmful content when misaligned with human values, posing severe risks to users and society. To mitigate these risks, current evaluation benchmarks predominantly employ expert-designed contextual scenarios to assess how well LLMs align with human values. However, the labor-intensive nature of these benchmarks limits their test scope, hindering their ability to generalize to the extensive variety of open-world use cases and identify rare but crucial long-tail risks. Additionally, these static tests fail to adapt to the rapid evolution of LLMs, making it hard to evaluate timely alignment issues. To address these challenges, we propose ALI-Agent, an evaluation framework that leverages the autonomous abilities of LLM-powered agents to conduct in-depth and adaptive alignment assessments. ALI-Agent operates through two principal stages: Emulation and Refinement. During the Emulation stage, ALI-Agent automates the generation of realistic test scenarios. In the Refinement stage, it iteratively refines the scenarios to probe long-tail risks. Specifically, ALI-Agent incorporates a memory module to guide test scenario generation, a tool-using module to reduce human labor in tasks such as evaluating feedback from target LLMs, and an action module to refine tests. Extensive experiments across three aspects of human values--stereotypes, morality, and legality--demonstrate that ALI-Agent, as a general evaluation framework, effectively identifies model misalignment. Systematic analysis also validates that the generated test scenarios represent meaningful use cases, as well as integrate enhanced measures to probe long-tail risks. Our code is available at https://github.com/SophieZheng998/ALI-Agent.git

ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based Evaluation

TL;DR

Abstract

Paper Structure (29 sections, 7 equations, 4 figures, 27 tables, 1 algorithm)

This paper contains 29 sections, 7 equations, 4 figures, 27 tables, 1 algorithm.

Method of ALI-Agent
Emulation Stage
Refinement Stage
Experiments
Performance Comparison (RQ1)
Study on Test Scenarios (RQ2)
Study on ALI-Agent (RQ3)
Related Work
Conclusion
Additional related work
Alignment of LLMs
Red Teaming LLMs
Broader Impacts
Scalability & Generalizability
Experiments
...and 14 more sections

Figures (4)

Figure 2: An example of ALI-Agent's implementation. In the emulation stage, ALI-Agent generates ${x_k}^{(0)}$, a realistic scenario that reflects violations against ${x_k}$, a law regulation, with $m_j$ serving as an in-context demonstration. In the refinement stage, ALI-Agent refines ${x_k}^{(0)}$ to ${x_k}^{(1)}$ by adding an extra excuse, making the misconduct of "eating on MRT" appears more reasonable and successfully misleads $\mathcal{T}$ to overlook the issue. This pattern of wrapping up misconduct is saved back to $\mathbf{M}$ in the form of $m_k = ({x_k}, {x_k}^{(1)}, {e_k}^{(1)})$ for subsequent tests, boosting ALI-Agent's ability to generalize risky tests to new cases.
Figure 3: Examples of misconduct, scenarios generated and refined by ALI-Agent. The highlighted parts show how ALI-Agent refines sensitive content to lower its perceptible sensitivity, thereby probing long-tail risks. In these examples, target LLMs only fail to properly identify the corresponding misconduct when prompted with the refined scenarios.
Figure 4: Study on ALI-Agent. Figure \ref{['fig:ablation']}(a) demonstrates the impact of each component (i.e., evaluation memory, iterative refiner) on ETHICS dataset. Figure \ref{['fig:ablation']}(b) showcases the benefits of multiple refinement iterations and the effective adaptability of integrating jailbreak techniques (e.g., GPTFuzzer GPTFUZZER) on AdvBench dataset.
Figure 5: Log-scale moderation scores on DecodingTrust decodingtrust and Social Chemistry 101 socialchemistry. Zero-shot refers to using the original misconduct as test prompts, which later serve as input to ALI-Agent. The higher the moderation score, the more sensitive content the prompt contains. Refining the test scenarios further reduces the perceivable harmfulness, enhancing the difficulty for target LLMs to identify the risks.

ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based Evaluation

TL;DR

Abstract

ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)