Table of Contents
Fetching ...

The Personalization Paradox: Semantic Loss vs. Reasoning Gains in Agentic AI Q&A

Satyajit Movidi, Stephen Russell

TL;DR

AiVisor investigates personalization in agentic Q&A for student advising, employing a retrieval-augmented LLM pipeline evaluated across lexical, semantic, and grounding metrics under a lexically stringent test. The study demonstrates metric-dependent trade-offs: personalization improves reasoning and grounding while causing semantic similarity penalties when compared to a single generic ground truth, highlighting methodological limits of standard metrics. Using a Linear Mixed-Effects Model and multi-metric normalization, the work reveals complex interactions between role prompting, retrieval conditioning, and personalization stages. Fully integrated personalization (System K) achieves the strongest composite performance by balancing reasoning gains with grounding enhancements, providing a methodological framework for robust, transparent personalization in agentic AI.

Abstract

AIVisor, an agentic retrieval-augmented LLM for student advising, was used to examine how personalization affects system performance across multiple evaluation dimensions. Using twelve authentic advising questions intentionally designed to stress lexical precision, we compared ten personalized and non-personalized system configurations and analyzed outcomes with a Linear Mixed-Effects Model across lexical (BLEU, ROUGE-L), semantic (METEOR, BERTScore), and grounding (RAGAS) metrics. Results showed a consistent trade-off: personalization reliably improved reasoning quality and grounding, yet introduced a significant negative interaction on semantic similarity, driven not by poorer answers but by the limits of current metrics, which penalize meaningful personalized deviations from generic reference texts. This reveals a structural flaw in prevailing LLM evaluation methods, which are ill-suited for assessing user-specific responses. The fully integrated personalized configuration produced the highest overall gains, suggesting that personalization can enhance system effectiveness when evaluated with appropriate multidimensional metrics. Overall, the study demonstrates that personalization produces metric-dependent shifts rather than uniform improvements and provides a methodological foundation for more transparent and robust personalization in agentic AI.

The Personalization Paradox: Semantic Loss vs. Reasoning Gains in Agentic AI Q&A

TL;DR

AiVisor investigates personalization in agentic Q&A for student advising, employing a retrieval-augmented LLM pipeline evaluated across lexical, semantic, and grounding metrics under a lexically stringent test. The study demonstrates metric-dependent trade-offs: personalization improves reasoning and grounding while causing semantic similarity penalties when compared to a single generic ground truth, highlighting methodological limits of standard metrics. Using a Linear Mixed-Effects Model and multi-metric normalization, the work reveals complex interactions between role prompting, retrieval conditioning, and personalization stages. Fully integrated personalization (System K) achieves the strongest composite performance by balancing reasoning gains with grounding enhancements, providing a methodological framework for robust, transparent personalization in agentic AI.

Abstract

AIVisor, an agentic retrieval-augmented LLM for student advising, was used to examine how personalization affects system performance across multiple evaluation dimensions. Using twelve authentic advising questions intentionally designed to stress lexical precision, we compared ten personalized and non-personalized system configurations and analyzed outcomes with a Linear Mixed-Effects Model across lexical (BLEU, ROUGE-L), semantic (METEOR, BERTScore), and grounding (RAGAS) metrics. Results showed a consistent trade-off: personalization reliably improved reasoning quality and grounding, yet introduced a significant negative interaction on semantic similarity, driven not by poorer answers but by the limits of current metrics, which penalize meaningful personalized deviations from generic reference texts. This reveals a structural flaw in prevailing LLM evaluation methods, which are ill-suited for assessing user-specific responses. The fully integrated personalized configuration produced the highest overall gains, suggesting that personalization can enhance system effectiveness when evaluated with appropriate multidimensional metrics. Overall, the study demonstrates that personalization produces metric-dependent shifts rather than uniform improvements and provides a methodological foundation for more transparent and robust personalization in agentic AI.

Paper Structure

This paper contains 21 sections, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Block illustration of the AiVisor system with Personalization Agent, VectorDB, and Prompt Assembly.
  • Figure 2: System-level means illustrating correlation strength across all eight metrics.
  • Figure 3: Standardized (z-score) system performance by metric. Positive values indicate above-average performance for that metric relative to other systems; blanks indicate unavailable metrics.
  • Figure 4: Distribution of BLEU, ROUGE-L, METEOR, and BERTScore across systems. Box plots illustrate within-metric variance and outliers, providing a baseline view of lexical and semantic dispersion prior to normalization and system-level aggregation.
  • Figure 5: Overall system (composite) performance.
  • ...and 4 more figures