Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora

Tristan Karch; Luca Engel; Philippe Schwaller; Frédéric Kaplan

Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora

Tristan Karch, Luca Engel, Philippe Schwaller, Frédéric Kaplan

TL;DR

The paper addresses how to pre-screen text collections for their potential to enhance LLM knowledge without retraining. It proposes a MCQ-based evaluation framework that measures information potential (IP) by comparing model performance with and without access to source content, using a two-stage quality filter and position-biased, four-rotation testing, with $ IP = \frac{C_{\text{context}}-C_{\text{direct}}}{|\mathcal{Q}|-(I_{\text{context}}+I_{\text{direct}})} $. Empirical results show higher IP for specialized corpora (e.g., EPFL PhD manuscripts, Venetian records) and lower IP for widely available content (Wikipedia) or synthetic baselines, with patterns consistent across open and closed models. This supports the method as a practical, dataset-agnostic tool for prioritizing digitization and integration of new information sources for RAG systems or fine-tuning, enabling more efficient allocation of resources in data-centric AI.

Abstract

As large language models (LLMs) converge towards similar capabilities, the key to advancing their performance lies in identifying and incorporating valuable new information sources. However, evaluating which text collections are worth the substantial investment required for digitization, preprocessing, and integration into LLM systems remains a significant challenge. We present a novel approach to this challenge: an automated pipeline that evaluates the potential information gain from text collections without requiring model training or fine-tuning. Our method generates multiple choice questions (MCQs) from texts and measures an LLM's performance both with and without access to the source material. The performance gap between these conditions serves as a proxy for the collection's information potential. We validate our approach using five strategically selected datasets: EPFL PhD manuscripts, a private collection of Venetian historical records, two sets of Wikipedia articles on related topics, and a synthetic baseline dataset. Our results demonstrate that this method effectively identifies collections containing valuable novel information, providing a practical tool for prioritizing data acquisition and integration efforts.

Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora

TL;DR

Abstract

Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)