Table of Contents
Fetching ...

Investigating LLM Variability in Personalized Conversational Information Retrieval

Simon Lupart, Daniël van Dijk, Eric Langezaal, Ian van Dort, Mohammad Aliannejadi

TL;DR

The paper investigates LLM variability in personalized Conversational Information Retrieval by reproducing and extending Mo et al.'s work on the TREC iKAT datasets. Through multi-run experiments, open-model comparisons, and evaluation on iKAT 2023 and 2024, it shows human-selected PTKB information consistently improves retrieval, while LLM-based selection is not reliably superior, and highlights higher variability on iKAT than on CAsT. It also demonstrates that recall-oriented metrics are more stable across runs, underlining the importance of multi-run reporting. The study generalizes to iKAT 2024 and finds that larger open-source LLMs can approach GPT-level performance with few-shot prompting, while emphasizing robust evaluation practices for personalized CIR.

Abstract

Personalized Conversational Information Retrieval (CIR) has seen rapid progress in recent years, driven by the development of Large Language Models (LLMs). Personalized CIR aims to enhance document retrieval by leveraging user-specific information, such as preferences, knowledge, or constraints, to tailor responses to individual needs. A key resource for this task is the TREC iKAT 2023 dataset, designed to evaluate personalization in CIR pipelines. Building on this resource, Mo et al. explored several strategies for incorporating Personal Textual Knowledge Bases (PTKB) into LLM-based query reformulation. Their findings suggested that personalization from PTKBs could be detrimental and that human annotations were often noisy. However, these conclusions were based on single-run experiments using the GPT-3.5 Turbo model, raising concerns about output variability and repeatability. In this reproducibility study, we rigorously reproduce and extend their work, focusing on LLM output variability and model generalization. We apply the original methods to the new TREC iKAT 2024 dataset and evaluate a diverse range of models, including Llama (1B-70B), Qwen-7B, GPT-4o-mini. Our results show that human-selected PTKBs consistently enhance retrieval performance, while LLM-based selection methods do not reliably outperform manual choices. We further compare variance across datasets and observe higher variability on iKAT than on CAsT, highlighting the challenges of evaluating personalized CIR. Notably, recall-oriented metrics exhibit lower variance than precision-oriented ones, a critical insight for first-stage retrievers. Finally, we underscore the need for multi-run evaluations and variance reporting when assessing LLM-based CIR systems. By broadening evaluation across models, datasets, and metrics, our study contributes to more robust and generalizable practices for personalized CIR.

Investigating LLM Variability in Personalized Conversational Information Retrieval

TL;DR

The paper investigates LLM variability in personalized Conversational Information Retrieval by reproducing and extending Mo et al.'s work on the TREC iKAT datasets. Through multi-run experiments, open-model comparisons, and evaluation on iKAT 2023 and 2024, it shows human-selected PTKB information consistently improves retrieval, while LLM-based selection is not reliably superior, and highlights higher variability on iKAT than on CAsT. It also demonstrates that recall-oriented metrics are more stable across runs, underlining the importance of multi-run reporting. The study generalizes to iKAT 2024 and finds that larger open-source LLMs can approach GPT-level performance with few-shot prompting, while emphasizing robust evaluation practices for personalized CIR.

Abstract

Personalized Conversational Information Retrieval (CIR) has seen rapid progress in recent years, driven by the development of Large Language Models (LLMs). Personalized CIR aims to enhance document retrieval by leveraging user-specific information, such as preferences, knowledge, or constraints, to tailor responses to individual needs. A key resource for this task is the TREC iKAT 2023 dataset, designed to evaluate personalization in CIR pipelines. Building on this resource, Mo et al. explored several strategies for incorporating Personal Textual Knowledge Bases (PTKB) into LLM-based query reformulation. Their findings suggested that personalization from PTKBs could be detrimental and that human annotations were often noisy. However, these conclusions were based on single-run experiments using the GPT-3.5 Turbo model, raising concerns about output variability and repeatability. In this reproducibility study, we rigorously reproduce and extend their work, focusing on LLM output variability and model generalization. We apply the original methods to the new TREC iKAT 2024 dataset and evaluate a diverse range of models, including Llama (1B-70B), Qwen-7B, GPT-4o-mini. Our results show that human-selected PTKBs consistently enhance retrieval performance, while LLM-based selection methods do not reliably outperform manual choices. We further compare variance across datasets and observe higher variability on iKAT than on CAsT, highlighting the challenges of evaluating personalized CIR. Notably, recall-oriented metrics exhibit lower variance than precision-oriented ones, a critical insight for first-stage retrievers. Finally, we underscore the need for multi-run evaluations and variance reporting when assessing LLM-based CIR systems. By broadening evaluation across models, datasets, and metrics, our study contributes to more robust and generalizable practices for personalized CIR.

Paper Structure

This paper contains 18 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Performance comparison between CAsT 2020 and iKAT 2023, using rewrites from three LLMs. Personalized CIR is more complex, involving not only disambiguations of anaphora and ellipsis but also personalizing from the PTKB. Results averaged over 10 runs, using human PTKB for iKAT.
  • Figure 2: Outline of the LLM-aided conversational IR pipeline with a separate PTKB selection and reformulation stage. Select Then Retrieve (STR) follows this setup.
  • Figure 3: Outline of the Select And Reformulate (SAR) conversational IR pipeline.
  • Figure 4: Open Source LLMs Performance on iKAT 2023 with varying number of in-context learning examples (Recall@1000). Retrieval with SAR, and BM25.
  • Figure 5: Agreement between human and automatic method (Recall@1000) for PTKB selection; split according to how many of the five automatic runs chose the same PTKB.