Table of Contents
Fetching ...

Personalized Reasoning: Just-In-Time Personalization and Why LLMs Fail At It

Shuyue Stella Li, Avinandan Bose, Faeze Brahman, Simon Shaolei Du, Pang Wei Koh, Maryam Fazel, Yulia Tsvetkov

TL;DR

The paper argues that effective human-facing AI requires just-in-time personalized reasoning that integrates problem solving with inference of user preferences. It formalizes personalized reasoning as discovering context-relevant attributes, eliciting user values, and adapting reasoning with a joint objective balancing correctness and alignment. PrefDisco is introduced as an end-to-end benchmark pipeline that creates psychologically-grounded personas, context-dependent preferences, attribute-specific rubrics, and realistic user simulations to diagnose preference discovery and adaptation. Across 21 frontier models and 10 tasks, the framework reveals substantial gaps: many personalization attempts degrade alignment, questioning quality and limited questioning hinder performance, and accuracy can drop under personalization, with domain-specific brittleness suggesting dedicated personalization research is needed for education, healthcare, and technical domains.

Abstract

Current large language model (LLM) development treats task-solving and preference alignment as separate challenges, optimizing first for objective correctness, then for alignment to aggregated human preferences. This paradigm fails in human-facing applications where solving a problem correctly is insufficient if the response mismatches the user's needs. This challenge intensifies in just-in-time scenarios where no prior user interaction history exists due to cold-start conditions or privacy constraints. LLMs need to identify what they don't know about user preferences, strategically elicit preference values through questioning, then adapt their reasoning processes and responses accordingly -- a complicated chain of cognitive processes which we term personalized reasoning. We introduce PREFDISCO, an evaluation methodology that transforms static benchmarks into interactive personalization tasks using psychologically-grounded personas with sparse preferences. Our framework creates scenarios where identical questions require different reasoning chains depending on user context, as optimal explanation approaches vary by individual expertise and preferences while maintaining factual accuracy. Evaluation of 21 frontier models across 10 tasks reveals 29.0% of naive personalization attempts produce worse preference alignment than generic responses, yet generic responses also fail to serve individual user needs effectively. These findings suggest personalized reasoning requires dedicated development rather than emerging naturally. PREFDISCO establishes personalized reasoning as a measurable research frontier and reveals fundamental limitations in current LLMs' interactive capabilities, providing a foundation for developing systems that can adapt to individual users in education, healthcare, and technical domains where personalization is critical.

Personalized Reasoning: Just-In-Time Personalization and Why LLMs Fail At It

TL;DR

The paper argues that effective human-facing AI requires just-in-time personalized reasoning that integrates problem solving with inference of user preferences. It formalizes personalized reasoning as discovering context-relevant attributes, eliciting user values, and adapting reasoning with a joint objective balancing correctness and alignment. PrefDisco is introduced as an end-to-end benchmark pipeline that creates psychologically-grounded personas, context-dependent preferences, attribute-specific rubrics, and realistic user simulations to diagnose preference discovery and adaptation. Across 21 frontier models and 10 tasks, the framework reveals substantial gaps: many personalization attempts degrade alignment, questioning quality and limited questioning hinder performance, and accuracy can drop under personalization, with domain-specific brittleness suggesting dedicated personalization research is needed for education, healthcare, and technical domains.

Abstract

Current large language model (LLM) development treats task-solving and preference alignment as separate challenges, optimizing first for objective correctness, then for alignment to aggregated human preferences. This paradigm fails in human-facing applications where solving a problem correctly is insufficient if the response mismatches the user's needs. This challenge intensifies in just-in-time scenarios where no prior user interaction history exists due to cold-start conditions or privacy constraints. LLMs need to identify what they don't know about user preferences, strategically elicit preference values through questioning, then adapt their reasoning processes and responses accordingly -- a complicated chain of cognitive processes which we term personalized reasoning. We introduce PREFDISCO, an evaluation methodology that transforms static benchmarks into interactive personalization tasks using psychologically-grounded personas with sparse preferences. Our framework creates scenarios where identical questions require different reasoning chains depending on user context, as optimal explanation approaches vary by individual expertise and preferences while maintaining factual accuracy. Evaluation of 21 frontier models across 10 tasks reveals 29.0% of naive personalization attempts produce worse preference alignment than generic responses, yet generic responses also fail to serve individual user needs effectively. These findings suggest personalized reasoning requires dedicated development rather than emerging naturally. PREFDISCO establishes personalized reasoning as a measurable research frontier and reveals fundamental limitations in current LLMs' interactive capabilities, providing a foundation for developing systems that can adapt to individual users in education, healthcare, and technical domains where personalization is critical.

Paper Structure

This paper contains 45 sections, 6 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Personalized reasoning in a medical scenario. Current LLMs provide generic responses without considering the user (left); a model with personalized reasoning capabilities incorporates discovered preferences to provide responses that is both correct and aligned to the user (right).
  • Figure 2: PrefDisco benchmark construction pipeline. The framework transforms static benchmarks by sampling sparse, context-dependent preference subsets for each user-task pair, generating attribute-specific evaluation rubrics, and implementing realistic user simulation that requires models to discover preferences through "just-in-time" strategic questioning in cold-start scenarios.
  • Figure 3: Positive correlation (r=0.445) between question volume and preference alignment. Better personalization requires more extensive questioning. Regression coefficients: Claude=0.117, OpenAI=0.379, Gemini=0.474.
  • Figure 4: More personalization constraints in context hinder model reasoning abilities. Overall accuracy: Baseline=0.652, Oracle=0.618, Discovery=0.601. Trade-off is most pronounced in Math, AIME, and logic tasks.
  • Figure 5: Fixed interaction length hinders preference alignment on math and science tasks but improves preference alignment on social reasoning.
  • ...and 1 more figures