IPQA: A Benchmark for Core Intent Identification in Personalized Question Answering
Jieyong Kim, Maryam Amirizaniani, Soojin Yoon, Dongha Lee
TL;DR
IPQA introduces core intents as the prioritized motivations behind user questions in personalized question answering and provides a dedicated benchmark to evaluate core-intent identification. The dataset is built from cQA-based narratives and answers across 47 domains, with LLM-based intent annotation, rigorous quality control, and a dedicated IPQA-Eval framework to align model predictions with ground-truth core intents. Experimental results show that state-of-the-art language models struggle to infer core intents from user histories, with performance deteriorating as question complexity and multi-intent scenarios increase. The work establishes core-intent identification as a foundational challenge in PQA and provides a practical framework for advancing personalization-aware intent understanding, including resources to foster future research.
Abstract
Intent identification serves as the foundation for generating appropriate responses in personalized question answering (PQA). However, existing benchmarks evaluate only response quality or retrieval performance without directly measuring intent identification capabilities. This gap is critical because without understanding which intents users prioritize, systems cannot generate responses satisfying individual information needs. To address this, we introduce the concept of core intents: intents users prioritize when selecting answers to satisfy their information needs. To evaluate these core intents, we propose IPQA, a benchmark for core Intent identification in Personalized Question Answering. Since users do not explicitly state their prioritized intents, we derive core intents from observable behavior patterns in answer selection, grounded in satisficing theory where users choose answers meeting their acceptance thresholds. We construct a dataset with various domains through systematic filtering, LLM-based annotation, and rigorous quality control combining automated verification with human validation. Experimental evaluations across state-of-the-art language models reveal that current systems struggle with core intent identification in personalized contexts. Models fail to identify core intents from user histories, with performance degrading as question complexity increases. The code and dataset will be made publicly available to facilitate future research in this direction.
