Table of Contents
Fetching ...

GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification

Iordanis Fostiropoulos, Muhammad Rafay Azhar, Abdalaziz Sawwan, Boyu Fang, Yuchen Liu, Jiayi Liu, Hanchao Yu, Qi Guo, Jianyu Wang, Fei Liu, Xiangjun Fan

Abstract

We introduce GISTBench, a benchmark for evaluating Large Language Models' (LLMs) ability to understand users from their interaction histories in recommendation systems. Unlike traditional RecSys benchmarks that focus on item prediction accuracy, our benchmark evaluates how well LLMs can extract and verify user interests from engagement data. We propose two novel metric families: Interest Groundedness (IG), decomposed into precision and recall components to separately penalize hallucinated interest categories and reward coverage, and Interest Specificity (IS), which assesses the distinctiveness of verified LLM-predicted user profiles. We release a synthetic dataset constructed on real user interactions on a global short-form video platform. Our dataset contains both implicit and explicit engagement signals and rich textual descriptions. We validate our dataset fidelity against user surveys, and evaluate eight open-weight LLMs spanning 7B to 120B parameters. Our findings reveal performance bottlenecks in current LLMs, particularly their limited ability to accurately count and attribute engagement signals across heterogeneous interaction types.

GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification

Abstract

We introduce GISTBench, a benchmark for evaluating Large Language Models' (LLMs) ability to understand users from their interaction histories in recommendation systems. Unlike traditional RecSys benchmarks that focus on item prediction accuracy, our benchmark evaluates how well LLMs can extract and verify user interests from engagement data. We propose two novel metric families: Interest Groundedness (IG), decomposed into precision and recall components to separately penalize hallucinated interest categories and reward coverage, and Interest Specificity (IS), which assesses the distinctiveness of verified LLM-predicted user profiles. We release a synthetic dataset constructed on real user interactions on a global short-form video platform. Our dataset contains both implicit and explicit engagement signals and rich textual descriptions. We validate our dataset fidelity against user surveys, and evaluate eight open-weight LLMs spanning 7B to 120B parameters. Our findings reveal performance bottlenecks in current LLMs, particularly their limited ability to accurately count and attribute engagement signals across heterogeneous interaction types.

Paper Structure

This paper contains 76 sections, 9 equations, 9 figures, 20 tables.

Figures (9)

  • Figure 1: Overview of the GISTBench evaluation pipeline. User interaction histories are processed by an LLM to produce predicted interests. These are evaluated on two axes: Interest Groundedness (IG, decomposed into precision and recall) and Interest Specificity (IS, computed over verified categories only), then mapped to standardized taxonomy categories for normalization.
  • Figure 2: Comparison of synthetic and real UIH distributions. (a) Per-user engagement count densities show similar right-skewed patterns. (b) Box plots confirm that the relative ordering across action types is preserved.
  • Figure 3: IG Precision vs. IG Recall per model on the survey dataset. Bubble size encodes the average number of predicted interest categories; color encodes IG$_{F1}$. Dotted curves are iso-F1 contours at 0.1--0.5. All models cluster in the low-recall region (IG$_R < 0.4$), confirming that coverage is the universal bottleneck. DeepSeek-R1 achieves the highest precision (57.3%) and IG$_{F1}$ (42.9%). Qwen2.5-72B offers the best balance between precision and recall among capable models.
  • Figure 4: Per-user IG Precision vs. IG Recall on the benchmark datasets, colored by model. Diamond markers show per-model medians; small dots show individual users ($\alpha=0.15$). The diagonal reference line ($y=x$) highlights the universal asymmetry IG$_P >$ IG$_R$. Within-model variance is substantial: even top models have many users with IG$_P < 0.5$. This reflects user-level difficulty variation.
  • Figure 5: Forest plot showing 95% confidence intervals for mean IG$_{F1}$ scores on the survey dataset. Overlapping intervals indicate that ranking differences between adjacent models may not be statistically significant.
  • ...and 4 more figures