No Dataset Needed for Downstream Knowledge Benchmarking: Response Dispersion Inversely Correlates with Accuracy on Domain-specific QA

Robert L Simione

No Dataset Needed for Downstream Knowledge Benchmarking: Response Dispersion Inversely Correlates with Accuracy on Domain-specific QA

Robert L Simione

TL;DR

The paper addresses the problem of evaluating domain-specific knowledge in LLMs without constructing QA benchmarks by introducing response dispersion, a metric computed from the singular-value structure of response embeddings across multiple seeds. It defines dispersion as the number of singular values needed to explain $95\%$ of the variance in the response-embedding matrix and compares two embedding methods (OpenAI's $text$-embedding-$3$-$large$ and RSS embeddings) on a repurposed dataset, IRC-WikiTriviaQA, spanning 11 domains. Empirical results show a consistent inverse relationship between dispersion and QA accuracy (Spearman roughly between $-0.59$ and $-0.71$ across categories), and a use-case analysis indicates dispersion-based model choice aligns with QA performance in about $74\%-89\%$ of cases depending on tolerance. The approach offers a scalable, end-user-centric alternative to QA benchmarking with potential applications in iterative finetuning, continual learning, and retrieval-augmented generation, while highlighting limitations due to dataset choice and domain coverage.

Abstract

This research seeks to obviate the need for creating QA datasets and grading (chatbot) LLM responses when comparing LLMs' knowledge in specific topic domains. This is done in an entirely end-user centric way without need for access to any inner workings of the LLM, so long as it can be prompted and given a random seed to create different generations to the same prompt. The paper does this by, for a given topic domain, defining the "response dispersion" of an LLM by repeatedly asking an LLM the same opinion question about that topic domain. Namely, the response dispersion is the count of singular values needed to explain 95% of the variance in the embedding matrix of the LLM's responses. It is found that the response dispersion is inversely correlated with accuracy on relevant QA evaluations (average spearman rank correlation stronger than -.59). A use-case analysis shows that when comparing two different LLMs on the same topic domain, comparing their response dispersion is a suitable replacement for comparing their QA accuracy between 74% and 89% of the time, the range depending on certain reasonable accuracy-difference tolerances that may be acceptable to an end-user in exchange for the labor being saved using response dispersion instead of QA accuracy for comparison. Two response embeddings are studied for creating the embedding matrix in this study, one is from OpenAI's APIs and one is a novel embedding, here named reference sentence similarity embeddings, that can be computed locally and performs very nearly as well in calculating response dispersion. Also in this research, a pre-existing dataset called the IRC-Wiki Trivia dataset, originally developed for trivia games, has been re-purposed, curated, and the curation, called IRC-WikiTriviaQA, is made available for the purpose of this research.

No Dataset Needed for Downstream Knowledge Benchmarking: Response Dispersion Inversely Correlates with Accuracy on Domain-specific QA

TL;DR

of the variance in the response-embedding matrix and compares two embedding methods (OpenAI's

-embedding-

and RSS embeddings) on a repurposed dataset, IRC-WikiTriviaQA, spanning 11 domains. Empirical results show a consistent inverse relationship between dispersion and QA accuracy (Spearman roughly between

and

across categories), and a use-case analysis indicates dispersion-based model choice aligns with QA performance in about

of cases depending on tolerance. The approach offers a scalable, end-user-centric alternative to QA benchmarking with potential applications in iterative finetuning, continual learning, and retrieval-augmented generation, while highlighting limitations due to dataset choice and domain coverage.

Abstract

Paper Structure (21 sections, 1 figure)

This paper contains 21 sections, 1 figure.

Introduction
Background and Motivation
End-User Centric Assumptions
Procedure Overview
Paper structure
Response Dispersion
Motivating Hypothesis
Defining Response Dispersion Using Response Embeddings
Response Embeddings Used
Validation Methodology
Introducing the IRC-WikiTriviaQA Dataset
Prompting LLM responses to the IRC-Wiki Trivia questions
Grading the LLM responses to the IRC-Wiki Trivia questions
Response Dispersion Use-Case Analysis Defined
LLMs studied
...and 6 more sections

Figures (1)

Figure 1: Success % of choosing the best or good-enough model at different tolerance levels, averaged over all models and topic categories

No Dataset Needed for Downstream Knowledge Benchmarking: Response Dispersion Inversely Correlates with Accuracy on Domain-specific QA

TL;DR

Abstract

No Dataset Needed for Downstream Knowledge Benchmarking: Response Dispersion Inversely Correlates with Accuracy on Domain-specific QA

Authors

TL;DR

Abstract

Table of Contents

Figures (1)