Pathways of Thoughts: Multi-Directional Thinking for Long-form Personalized Question Answering
Alireza Salemi, Cheng Li, Mingyang Zhang, Qiaozhu Mei, Zhuowan Li, Spurthi Amba Hombaiah, Weize Kong, Tao Chen, Hamed Zamani, Michael Bendersky
TL;DR
Pathways of Thoughts (PoT) tackles the challenge of personalized question answering with long, noisy user contexts by reframing model thinking as an iterative Markov Decision Process. It enables multi-directional thinking through diverse planning trajectories and aggregates outputs via Mixture-of-N to align responses with inferred user preferences, all without task-specific fine-tuning. Empirical results on LaMP-QA show up to 10.8% relative improvements and strong human preference (66%), with demonstrated generalization across backbones such as Gemini 1.5 Pro and GPT-4o-mini. This approach offers a scalable, inference-time solution for personalized long-form QA that leverages diverse reasoning paths to improve factual alignment and user satisfaction.
Abstract
Personalization is well studied in search and recommendation, but personalized question answering remains underexplored due to challenges in inferring preferences from long, noisy, implicit contexts and generating responses that are both accurate and aligned with user expectations. To address this, we propose Pathways of Thoughts (PoT), an inference-stage method that applies to any large language model (LLM) without task-specific fine-tuning. PoT models the thinking as an iterative decision process, where the model dynamically selects among cognitive operations such as reasoning, revision, personalization, and clarification. This enables exploration of multiple reasoning trajectories, producing diverse candidate responses that capture different perspectives. PoT then aggregates and reweights these candidates according to inferred user preferences, yielding a final personalized response that benefits from the complementary strengths of diverse reasoning paths. Experiments on the LaMP-QA benchmark show that PoT consistently outperforms competitive baselines, achieving up to a 10.8\% relative improvement. Human evaluation further validates these improvements, with annotators preferring PoT in 66\% of cases compared to the best-performing baseline and reporting ties in 15\% of cases.
