Table of Contents
Fetching ...

Cold-Start Personalization via Training-Free Priors from Structured World Models

Avinandan Bose, Shuyue Stella Li, Faeze Brahman, Pang Wei Koh, Simon Shaolei Du, Yulia Tsvetkov, Maryam Fazel, Lin Xiao, Asli Celikyilmaz

TL;DR

Pep (Preference Elicitation with Priors) learns a structured world model of preference correlations offline from complete profiles, then performs training-free Bayesian inference online to select informative questions and predict complete preference profiles, including dimensions never asked about.

Abstract

Cold-start personalization requires inferring user preferences through interaction when no user-specific historical data is available. The core challenge is a routing problem: each task admits dozens of preference dimensions, yet individual users care about only a few, and which ones matter depends on who is asking. With a limited question budget, asking without structure will miss the dimensions that matter. Reinforcement learning is the natural formulation, but in multi-turn settings its terminal reward fails to exploit the factored, per-criterion structure of preference data, and in practice learned policies collapse to static question sequences that ignore user responses. We propose decomposing cold-start elicitation into offline structure learning and online Bayesian inference. Pep (Preference Elicitation with Priors) learns a structured world model of preference correlations offline from complete profiles, then performs training-free Bayesian inference online to select informative questions and predict complete preference profiles, including dimensions never asked about. The framework is modular across downstream solvers and requires only simple belief models. Across medical, mathematical, social, and commonsense reasoning, Pep achieves 80.8% alignment between generated responses and users' stated preferences versus 68.5% for RL, with 3-5x fewer interactions. When two users give different answers to the same question, Pep changes its follow-up 39-62% of the time versus 0-28% for RL. It does so with ~10K parameters versus 8B for RL, showing that the bottleneck in cold-start elicitation is the capability to exploit the factored structure of preference data.

Cold-Start Personalization via Training-Free Priors from Structured World Models

TL;DR

Pep (Preference Elicitation with Priors) learns a structured world model of preference correlations offline from complete profiles, then performs training-free Bayesian inference online to select informative questions and predict complete preference profiles, including dimensions never asked about.

Abstract

Cold-start personalization requires inferring user preferences through interaction when no user-specific historical data is available. The core challenge is a routing problem: each task admits dozens of preference dimensions, yet individual users care about only a few, and which ones matter depends on who is asking. With a limited question budget, asking without structure will miss the dimensions that matter. Reinforcement learning is the natural formulation, but in multi-turn settings its terminal reward fails to exploit the factored, per-criterion structure of preference data, and in practice learned policies collapse to static question sequences that ignore user responses. We propose decomposing cold-start elicitation into offline structure learning and online Bayesian inference. Pep (Preference Elicitation with Priors) learns a structured world model of preference correlations offline from complete profiles, then performs training-free Bayesian inference online to select informative questions and predict complete preference profiles, including dimensions never asked about. The framework is modular across downstream solvers and requires only simple belief models. Across medical, mathematical, social, and commonsense reasoning, Pep achieves 80.8% alignment between generated responses and users' stated preferences versus 68.5% for RL, with 3-5x fewer interactions. When two users give different answers to the same question, Pep changes its follow-up 39-62% of the time versus 0-28% for RL. It does so with ~10K parameters versus 8B for RL, showing that the bottleneck in cold-start elicitation is the capability to exploit the factored structure of preference data.
Paper Structure (59 sections, 1 theorem, 24 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 59 sections, 1 theorem, 24 equations, 4 figures, 4 tables, 1 algorithm.

Key Result

Proposition 1

Learning an effective questioning policy via RL requires exploring the space of question sequences, which grows combinatorially in $|\mathcal{C}(x)|$ (# of criteria). With only terminal feedback, sample complexity scales exponentially in the budget $T$. In contrast, learning preference correlations

Figures (4)

  • Figure 1: Overview of Pep. Offline: We learn a structured world model from population data capturing preference correlations through latent user embeddings. Online: For new users, we adaptively select questions, update beliefs after each response, and predict the full preference profile, including preferences not asked about, to generate personalized responses.
  • Figure 2: Query efficiency comparison between Pep and GRPO. Points above the dashed parity line indicate Pep requires fewer queries than GRPO to achieve equivalent preference alignment.
  • Figure 3: Adaptivity versus preference alignment across methods and datasets. Higher adaptivity is associated with better alignment.
  • Figure 4: Ablation study on (i) modeling preference correlations: without the latent structure that captures correlations between preferences (red), elicitation yields very slow improvement over population average, compared to modeling preference correlations (blue, gray); and (ii) adaptive querying: adaptive selection (blue) closes 24% of the gap at $T{=}5$, outperforming random querying (gray) at $T{=}6$, effectively saving one question.

Theorems & Definitions (1)

  • Proposition 1: Abbreviated