Table of Contents
Fetching ...

LIDS: LLM Summary Inference Under the Layered Lens

Dylan Park, Yingying Fan, Jinchi Lv

TL;DR

A new method of LLM summary inference with BERT-SVD-based direction metric and SOFARI (LIDS) that assesses the summary accuracy equipped with interpretable key words for layered themes is suggested.

Abstract

Large language models (LLMs) have gained significant attention by many researchers and practitioners in natural language processing (NLP) since the introduction of ChatGPT in 2022. One notable feature of ChatGPT is its ability to generate summaries based on prompts. Yet evaluating the quality of these summaries remains challenging due to the complexity of language. To this end, in this paper we suggest a new method of LLM summary inference with BERT-SVD-based direction metric and SOFARI (LIDS) that assesses the summary accuracy equipped with interpretable key words for layered themes. The LIDS uses a latent SVD-based direction metric to measure the similarity between the summaries and original text, leveraging the BERT embeddings and repeated prompts to quantify the statistical uncertainty. As a result, LIDS gives a natural embedding of each summary for large text reduction. We further exploit SOFARI to uncover important key words associated with each latent theme in the summary with controlled false discovery rate (FDR). Comprehensive empirical studies demonstrate the practical utility and robustness of LIDS through human verification and comparisons to other similarity metrics, including a comparison of different LLMs.

LIDS: LLM Summary Inference Under the Layered Lens

TL;DR

A new method of LLM summary inference with BERT-SVD-based direction metric and SOFARI (LIDS) that assesses the summary accuracy equipped with interpretable key words for layered themes is suggested.

Abstract

Large language models (LLMs) have gained significant attention by many researchers and practitioners in natural language processing (NLP) since the introduction of ChatGPT in 2022. One notable feature of ChatGPT is its ability to generate summaries based on prompts. Yet evaluating the quality of these summaries remains challenging due to the complexity of language. To this end, in this paper we suggest a new method of LLM summary inference with BERT-SVD-based direction metric and SOFARI (LIDS) that assesses the summary accuracy equipped with interpretable key words for layered themes. The LIDS uses a latent SVD-based direction metric to measure the similarity between the summaries and original text, leveraging the BERT embeddings and repeated prompts to quantify the statistical uncertainty. As a result, LIDS gives a natural embedding of each summary for large text reduction. We further exploit SOFARI to uncover important key words associated with each latent theme in the summary with controlled false discovery rate (FDR). Comprehensive empirical studies demonstrate the practical utility and robustness of LIDS through human verification and comparisons to other similarity metrics, including a comparison of different LLMs.
Paper Structure (25 sections, 4 equations, 15 figures, 5 tables, 1 algorithm)

This paper contains 25 sections, 4 equations, 15 figures, 5 tables, 1 algorithm.

Figures (15)

  • Figure 1: Rescaled boxplots of the LIDS, BLEU, ROUGE-1, ROUGE-L, METEOR, and BERTScore similarity measures for GPT-5 (dark blue) over $50$ repeated prompts and two benchmark summary mechanisms (light blue and gray) over $50$ random repetitions on the Utah article.
  • Figure 2: Boxplots of the LIDS similarity measures for GPT-5 over $50$ repeated prompts and two benchmark summary mechanisms over $50$ random repetitions on the Utah article.
  • Figure 3: Scatter plot of the average human evaluated summary quality scores on the horizontal axis and the LIDS similarity measure scores of the summaries on the vertical axis, with a linear regression line showing a Pearson correlation of $0.904$ between the two, for the experiment in Section \ref{['Subsec.HumanVerif']}.
  • Figure 4: LIDS visualization word cloud plots with FDR control at level $q = 0.005$ for the first three latent SVD layers of a representative LLM summary of the Utah article.
  • Figure 5: Comparison of different LLMs with LIDS in terms of the Sharpe ratio-type measure of accuracy per unit of uncertainty, i.e., the mean similarity divided by the corresponding standard deviation over $50$ repeated prompts on the Utah article. Larger values indicate better performance.
  • ...and 10 more figures