Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users

Nishant Balepur; Malachi Hamada; Varsha Kishore; Sergey Feldman; Amanpreet Singh; Pao Siangliulue; Joseph Chee Chang; Eunsol Choi; Jordan Lee Boyd-Graber; Aakanksha Naik

Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users

Nishant Balepur, Malachi Hamada, Varsha Kishore, Sergey Feldman, Amanpreet Singh, Pao Siangliulue, Joseph Chee Chang, Eunsol Choi, Jordan Lee Boyd-Graber, Aakanksha Naik

Abstract

Deep Research (DR) tools (e.g. OpenAI DR) help researchers cope with ballooning publishing counts. Such tools can synthesize scientific papers to answer researchers' queries, but lack understanding of their users. We change that in MyScholarQA (MySQA), a personalized DR tool that: 1) infers a profile of a user's research interests; 2) proposes personalized actions for a user's input query; and 3) writes a multi-section report for the query that follows user-approved actions. We first test MySQA with NLP's standard protocol: we design a benchmark of synthetic users and LLM judges, where MySQA beats baselines in citation metrics and personalized action-following. However, we suspect this process does not cover all aspects of personalized DR users value, so we interview users in an online version of MySQA to unmask them. We reveal nine nuanced errors of personalized DR undetectable by our LLM judges, and we study qualitative feedback to form lessons for future DR design. In all, we argue for a pillar of personalization that easy-to-use LLM judges can lead NLP to overlook: real progress in personalization is only possible with real users.

Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users

Abstract

Paper Structure (41 sections, 14 figures, 9 tables)

This paper contains 41 sections, 14 figures, 9 tables.

When Deep Research Gets to Know You
MySQA: Personalized Deep Research
Inferring Researcher Profiles
Proposing Actions to Take
Synthesizing a Personalized Report
A Formative Study with MySQA
Offline Evaluation
Dataset Collection
Metric Implementation
Baselines
Offline Results
Moving MySQA Offline to Online
Interviewing Active Deep Research Users
RQ1: What our Offline Evaluation Missed
LLMs Don't Know What DR Users Want
...and 26 more sections

Figures (14)

Figure 1: Overview of MySQA: a three-step personalized Deep Research system. (1) Researchers pick papers that represent them, from which an LLM infers a profile of their interests. (2) When the researcher asks a query, MySQA proposes a list of actions that can alter the report, tailored to the profile. (3) The system answers the query and executes said actions in a report via a multi-LLM pipeline. Users can edit/toggle profiles and actions and view highlights in the report where MySQA personalizes.
Figure 2: An editable profile inference created by MySQA.
Figure 3: Editable, tailored actions generated by MySQA.
Figure 4: Example of a report in MySQA with highlights, helping users find relevant content for each action they select.
Figure 5: User satisfaction with MySQA profiles, actions, and reports. Users are perfectly satisfied with $73\%$ of them.
...and 9 more figures

Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users

Abstract

Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users

Authors

Abstract

Table of Contents

Figures (14)