Table of Contents
Fetching ...

LiteraryTaste: A Preference Dataset for Creative Writing Personalization

John Joon Young Chung, Vishakh Padmakumar, Melissa Roemmele, Yi Wang, Yuqian Sun, Tiffany Wang, Shm Garanganao Almeda, Brett A. Halperin, Yuwen Lu, Max Kreminski

TL;DR

LiteraryTaste introduces a real-user dataset for creative writing personalization, pairing 60 annotators' stated reading preferences with revealed preferences over 100 short-text pairs. The authors systematically evaluate modeling approaches, finding that fine-tuning a transformer encoder (ModernBERT-large) achieves the best personal-preference accuracy ($$0.758$$) with $90$ training samples, and remains competitive with as few as $15$ samples, highlighting sample efficiency. Aggregated (group) preferences are harder to predict than individual preferences, though LLM prompting can sometimes outperform certain baselines in zero-shot settings. Stated preferences provide limited, sometimes helpful signals, but integrating them with revealed preferences yields inconsistent gains; the work also offers a detailed qualitative analysis of preference dimensions and a practical guide for eliciting personal preferences in creative-writing tools.

Abstract

People have different creative writing preferences, and large language models (LLMs) for these tasks can benefit from adapting to each user's preferences. However, these models are often trained over a dataset that considers varying personal tastes as a monolith. To facilitate developing personalized creative writing LLMs, we introduce LiteraryTaste, a dataset of reading preferences from 60 people, where each person: 1) self-reported their reading habits and tastes (stated preference), and 2) annotated their preferences over 100 pairs of short creative writing texts (revealed preference). With our dataset, we found that: 1) people diverge on creative writing preferences, 2) finetuning a transformer encoder could achieve 75.8% and 67.7% accuracy when modeling personal and collective revealed preferences, and 3) stated preferences had limited utility in modeling revealed preferences. With an LLM-driven interpretability pipeline, we analyzed how people's preferences vary. We hope our work serves as a cornerstone for personalizing creative writing technologies.

LiteraryTaste: A Preference Dataset for Creative Writing Personalization

TL;DR

LiteraryTaste introduces a real-user dataset for creative writing personalization, pairing 60 annotators' stated reading preferences with revealed preferences over 100 short-text pairs. The authors systematically evaluate modeling approaches, finding that fine-tuning a transformer encoder (ModernBERT-large) achieves the best personal-preference accuracy () with training samples, and remains competitive with as few as samples, highlighting sample efficiency. Aggregated (group) preferences are harder to predict than individual preferences, though LLM prompting can sometimes outperform certain baselines in zero-shot settings. Stated preferences provide limited, sometimes helpful signals, but integrating them with revealed preferences yields inconsistent gains; the work also offers a detailed qualitative analysis of preference dimensions and a practical guide for eliciting personal preferences in creative-writing tools.

Abstract

People have different creative writing preferences, and large language models (LLMs) for these tasks can benefit from adapting to each user's preferences. However, these models are often trained over a dataset that considers varying personal tastes as a monolith. To facilitate developing personalized creative writing LLMs, we introduce LiteraryTaste, a dataset of reading preferences from 60 people, where each person: 1) self-reported their reading habits and tastes (stated preference), and 2) annotated their preferences over 100 pairs of short creative writing texts (revealed preference). With our dataset, we found that: 1) people diverge on creative writing preferences, 2) finetuning a transformer encoder could achieve 75.8% and 67.7% accuracy when modeling personal and collective revealed preferences, and 3) stated preferences had limited utility in modeling revealed preferences. With an LLM-driven interpretability pipeline, we analyzed how people's preferences vary. We hope our work serves as a cornerstone for personalizing creative writing technologies.

Paper Structure

This paper contains 62 sections, 5 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: We present LiteraryTaste, a dataset for creative writing personalization. 60 annotators created the dataset, where each provided 100 binary preference annotations (revealed preference) and 34 survey responses, including those about reading habits and tastes (stated preference). Using the dataset, we addressed a series of research questions related to creative writing personalization.
  • Figure 2: Demographics of data collection participants.
  • Figure 3: Training approaches in RQ2, 3, and 4. Red, blue, and green indicate tuned weights, frozen models, and model output, respectively. a) For Full-Finetuning-based approaches, we finetuned all weights of the transformer encoder. b) For Logistic Regression, Decision Tree, and Neural Network-based approaches, we first embedded texts with frozen embedding models and then trained corresponding models with embeddings as training inputs. Approaches in a) and b) could be trained for aggregated preferences (Agg-, in RQ3) and Cross-annotator models (i.e., taking stated preference input to infer the preference from the perspective of annotators who would have such stated preference, in RQ4). c) Cross-LR-Weight (in RQ4) trains a neural network model that infers the weight of a logistic regression model given stated preference input. Note that, as embedding models, we used jinaai/jina-embeddings-v4günther2025jinaembeddingsv4 and ModerBERT-large finetuned on the style similarity dataset sterman2020interacting.
  • Figure 4: RQ2 results on personal preference modeling. All indicates training models on the concatenation of the semantic and style embeddings of texts, while Sem and Sty indicate only using semantic or style embeddings, respectively. Rand and Sim indicate sampling few shots either randomly or based on sample similarities, respectively. RSOff means turning off reasoning capability, while Synth uses SynthesizeMe! ryan2025synthesizeme to infer user profiles. Note that o4-mini and Sonnet-4 approaches do not have training accuracy as they are prompting-based. Error bars and ranges in this paper indicate 95% confidence intervals.
  • Figure 5: RQ2 results with varying training set sizes.
  • ...and 8 more figures