Table of Contents
Fetching ...

Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora

Tristan Karch, Luca Engel, Philippe Schwaller, Frédéric Kaplan

TL;DR

The paper addresses how to pre-screen text collections for their potential to enhance LLM knowledge without retraining. It proposes a MCQ-based evaluation framework that measures information potential (IP) by comparing model performance with and without access to source content, using a two-stage quality filter and position-biased, four-rotation testing, with $ IP = \frac{C_{\text{context}}-C_{\text{direct}}}{|\mathcal{Q}|-(I_{\text{context}}+I_{\text{direct}})} $. Empirical results show higher IP for specialized corpora (e.g., EPFL PhD manuscripts, Venetian records) and lower IP for widely available content (Wikipedia) or synthetic baselines, with patterns consistent across open and closed models. This supports the method as a practical, dataset-agnostic tool for prioritizing digitization and integration of new information sources for RAG systems or fine-tuning, enabling more efficient allocation of resources in data-centric AI.

Abstract

As large language models (LLMs) converge towards similar capabilities, the key to advancing their performance lies in identifying and incorporating valuable new information sources. However, evaluating which text collections are worth the substantial investment required for digitization, preprocessing, and integration into LLM systems remains a significant challenge. We present a novel approach to this challenge: an automated pipeline that evaluates the potential information gain from text collections without requiring model training or fine-tuning. Our method generates multiple choice questions (MCQs) from texts and measures an LLM's performance both with and without access to the source material. The performance gap between these conditions serves as a proxy for the collection's information potential. We validate our approach using five strategically selected datasets: EPFL PhD manuscripts, a private collection of Venetian historical records, two sets of Wikipedia articles on related topics, and a synthetic baseline dataset. Our results demonstrate that this method effectively identifies collections containing valuable novel information, providing a practical tool for prioritizing data acquisition and integration efforts.

Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora

TL;DR

The paper addresses how to pre-screen text collections for their potential to enhance LLM knowledge without retraining. It proposes a MCQ-based evaluation framework that measures information potential (IP) by comparing model performance with and without access to source content, using a two-stage quality filter and position-biased, four-rotation testing, with . Empirical results show higher IP for specialized corpora (e.g., EPFL PhD manuscripts, Venetian records) and lower IP for widely available content (Wikipedia) or synthetic baselines, with patterns consistent across open and closed models. This supports the method as a practical, dataset-agnostic tool for prioritizing digitization and integration of new information sources for RAG systems or fine-tuning, enabling more efficient allocation of resources in data-centric AI.

Abstract

As large language models (LLMs) converge towards similar capabilities, the key to advancing their performance lies in identifying and incorporating valuable new information sources. However, evaluating which text collections are worth the substantial investment required for digitization, preprocessing, and integration into LLM systems remains a significant challenge. We present a novel approach to this challenge: an automated pipeline that evaluates the potential information gain from text collections without requiring model training or fine-tuning. Our method generates multiple choice questions (MCQs) from texts and measures an LLM's performance both with and without access to the source material. The performance gap between these conditions serves as a proxy for the collection's information potential. We validate our approach using five strategically selected datasets: EPFL PhD manuscripts, a private collection of Venetian historical records, two sets of Wikipedia articles on related topics, and a synthetic baseline dataset. Our results demonstrate that this method effectively identifies collections containing valuable novel information, providing a practical tool for prioritizing data acquisition and integration efforts.

Paper Structure

This paper contains 23 sections, 3 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Llama 3 70B Performance on EPFL Dataset Across Different Cutoff Percentages. (with Separated Thresholding and Number of MCQs for the Corresponding Percentiles). NC 4x: Evaluation with no context correct for all 4 bias mitigation evaluations of each question; WC 4x: Evaluation with context correct for all 4 bias mitigation evaluations of each question. Questions Remaining: Fraction of the original number of questions still remaining for the given cutoff threshold. Cosine: Only cosine thresholding applied. Rouge-L and Jaccard: Only Rouge-L and Jaccard thresholding applied.
  • Figure 2: Information Potential Analysis Across Datasets and Models. Stacked barplot showing correct response overlap between context-free and context-provided conditions. Overlaying the barplot is a line plot showing the Information Potential (IP) scores. Higher IP scores indicate greater novel information content, with PhD manuscripts (EPFL) showing consistently higher IP (0.211-0.229) compared to Wikipedia (0.110-0.136) and synthetic baseline (0.125). Both open and closed-source models exhibit similar patterns despite architectural differences.
  • Figure 3: Distribution of the Correct Answer Among the Answer Options for the MCQ Dataset Generated with GPT-4o Before Positional Bias Mitigation.
  • Figure 4: Distribution of Correct Answer Letter Prediction for EPFL and Wikipedia MCQ Datasets Evaluated on GPT-4o After Positional Bias Mitigation.
  • Figure 5: Distribution of Correct Answer Letter Prediction for EPFL, Wikipedia, and Baseline MCQ Datasets Evaluated on Llama 3 70B After Positional Bias Mitigation.
  • ...and 2 more figures