Table of Contents
Fetching ...

CURATe: Benchmarking Personalised Alignment of Conversational AI Assistants

Lize Alberts, Benjamin Ellis, Andrei Lupu, Jakob Foerster

TL;DR

The paper introduces CURATe, a multi-turn benchmark to evaluate personalised alignment of LLM assistants under safety-critical user contexts. By testing ten models across five scenarios with a safety-constrained first turn and conflicting preferences, it uncovers systematic biases and failures in maintaining user-specific information, including sycophancy and misweighting of risk versus desire. An external evaluator reveals that even strong reasoning models struggle with personalized safety, though prompting with explicit safety considerations markedly improves performance compared to generic 'harmless' prompts. The work proposes concrete directions for robust personalised alignment, such as enhanced contextual attention, dynamic user modelling, and hierarchical information retention, to support safe, context-aware long-term human–AI interaction.

Abstract

We introduce a multi-turn benchmark for evaluating personalised alignment in LLM-based AI assistants, focusing on their ability to handle user-provided safety-critical contexts. Our assessment of ten leading models across five scenarios (with 337 use cases each) reveals systematic inconsistencies in maintaining user-specific consideration, with even top-rated "harmless" models making recommendations that should be recognised as obviously harmful to the user given the context provided. Key failure modes include inappropriate weighing of conflicting preferences, sycophancy (prioritising desires above safety), a lack of attentiveness to critical user information within the context window, and inconsistent application of user-specific knowledge. The same systematic biases were observed in OpenAI's o1, suggesting that strong reasoning capacities do not necessarily transfer to this kind of personalised thinking. We find that prompting LLMs to consider safety-critical context significantly improves performance, unlike a generic 'harmless and helpful' instruction. Based on these findings, we propose research directions for embedding self-reflection capabilities, online user modelling, and dynamic risk assessment in AI assistants. Our work emphasises the need for nuanced, context-aware approaches to alignment in systems designed for persistent human interaction, aiding the development of safe and considerate AI assistants.

CURATe: Benchmarking Personalised Alignment of Conversational AI Assistants

TL;DR

The paper introduces CURATe, a multi-turn benchmark to evaluate personalised alignment of LLM assistants under safety-critical user contexts. By testing ten models across five scenarios with a safety-constrained first turn and conflicting preferences, it uncovers systematic biases and failures in maintaining user-specific information, including sycophancy and misweighting of risk versus desire. An external evaluator reveals that even strong reasoning models struggle with personalized safety, though prompting with explicit safety considerations markedly improves performance compared to generic 'harmless' prompts. The work proposes concrete directions for robust personalised alignment, such as enhanced contextual attention, dynamic user modelling, and hierarchical information retention, to support safe, context-aware long-term human–AI interaction.

Abstract

We introduce a multi-turn benchmark for evaluating personalised alignment in LLM-based AI assistants, focusing on their ability to handle user-provided safety-critical contexts. Our assessment of ten leading models across five scenarios (with 337 use cases each) reveals systematic inconsistencies in maintaining user-specific consideration, with even top-rated "harmless" models making recommendations that should be recognised as obviously harmful to the user given the context provided. Key failure modes include inappropriate weighing of conflicting preferences, sycophancy (prioritising desires above safety), a lack of attentiveness to critical user information within the context window, and inconsistent application of user-specific knowledge. The same systematic biases were observed in OpenAI's o1, suggesting that strong reasoning capacities do not necessarily transfer to this kind of personalised thinking. We find that prompting LLMs to consider safety-critical context significantly improves performance, unlike a generic 'harmless and helpful' instruction. Based on these findings, we propose research directions for embedding self-reflection capabilities, online user modelling, and dynamic risk assessment in AI assistants. Our work emphasises the need for nuanced, context-aware approaches to alignment in systems designed for persistent human interaction, aiding the development of safe and considerate AI assistants.

Paper Structure

This paper contains 35 sections, 16 figures, 2 tables.

Figures (16)

  • Figure 1: Shortened version of Scenario 2 in CURATe, showing a situation where the user shares one safety-critical constraint and a conflicting (non-critical) preference of someone close to them, asking for a joint activity recommendation.
  • Figure 2: Mean pass rates (below) and ambiguous results (on top) across select models and scenarios. There is a significant universal drop in performance as soon as an actor with conflicting preferences is introduced, with a general downward trend as each further person is added. A much smaller effect is detected when three people's random (non-conflicting) preferences are included instead, confirming that it is an inability of handling conflicts rather than adding others' preferences per se. Ambiguous results ranged between 0% and 4.45%, most from Scenario 1.
  • Figure 3: Example completions by GPT-4o on cases from scenarios 1 and 2, with evaluation rating and explanation generated by LLaMA 3.1 405B-instruct. The evaluator was fed a reduced version of the conversation (without distraction factors), only focusing on the relevant safety-critical elements.
  • Figure 4: Average mean pass rates on Scenario 3 ablations, showing standard error. These compared: (a) a basic helpful/harmless (HH) vs. a specific guiding prompt; (b) placing the user's constraint in the middle vs. the beginning, (c) replacing the leading recommendation request with an unbiased one, and (d) switching the preferences/constraints of the user and first person (keeping the constraint in place). Our results show the inadequacy of HH prompting for personalised alignment, as well as the significant effect of prompting for personalised consideration; a primacy bias and bias for leading questions, and inconsistent prioritisation of people's needs/preferences depending on their role.
  • Figure 5: Confusion matrices showing rating agreement between the model and human judges. The matrices show the distribution of ratings (0: unsafe recommendation, 1: safe recommendation) between the model and each human judge.
  • ...and 11 more figures