Greedy Information Projection for LLM Data Selection

Victor Ye Dong; Kuan-Yun Lee; Jiamei Shuai; Shengfei Liu; Yi Liu; Jian Jiao

Greedy Information Projection for LLM Data Selection

Victor Ye Dong, Kuan-Yun Lee, Jiamei Shuai, Shengfei Liu, Yi Liu, Jian Jiao

Abstract

We present \emph{Greedy Information Projection} (\textsc{GIP}), a principled framework for choosing training examples for large language model fine-tuning. \textsc{GIP} casts selection as maximizing mutual information between a subset of examples and task-specific query signals, which may originate from LLM quality judgments, metadata, or other sources. The framework involves optimizing a closed-form mutual information objective defined using both data and query embeddings, naturally balancing {\it quality} and {\it diversity}. Optimizing this score is equivalent to maximizing the projection of the query embedding matrix onto the span of the selected data, which provides a geometric explanation for the co-emergence of quality and diversity. Building on this view, we employ a fast greedy matching-pursuit procedure with efficient projection-based updates. On instruction-following and mathematical reasoning datasets, \textsc{GIP} selects small subsets that match full-data fine-tuning while using only a fraction of examples and compute, unifying quality-aware and diversity-aware selection for efficient fine-tuning.

Greedy Information Projection for LLM Data Selection

Abstract

Paper Structure (53 sections, 5 theorems, 37 equations, 2 figures, 23 tables, 1 algorithm)

This paper contains 53 sections, 5 theorems, 37 equations, 2 figures, 23 tables, 1 algorithm.

Introduction
This work.
Related Work
Data Curation for Large-Scale Language Models
Information-Theoretic Objectives in Selection and Clustering
Active Learning and Coreset Selection
Problem formulation
Mutual Information Formulation
Greedy approximation algorithm
Greedy matching pursuit (MP)
Analysis of relaxation.
Computational Complexity and Practical Costs
Experiments
Datasets and baseline models
Implementation
...and 38 more sections

Key Result

Theorem 1

Maximizing mutual information defined in equation eq:mutual-inf-def is equivalent to optimizing

Figures (2)

Figure 1: Left: GSM8K performance comparison on Qwen3-8B (left panel) and Mistral-7B (right panel) across different training data percentages (2.5%, 5%, 10%, 20%). Our proposed methods MP+MA and MP+SC are competitive with strong baselines (Random, DSIR, DISF, LESS), and often improve over Random/DISF/LESS. In several settings, they approach the full-dataset (100%, shown as a horizontal reference line) performance at 10%--20% of training data, demonstrating strong data efficiency. Right: Geometric interpretation of GIP. The method maximizes mutual information between Gaussian projections induced by the data embedding matrix $F$ and score embedding matrix $Q$. This is equivalent to minimizing the volume (determinant) of score embeddings projected onto the null space of selected data, naturally balancing quality (high-score items) and diversity (new directions in embedding space).
Figure 2: MT-Bench per-category average scores on Mistral-7B under cleaned vs non-cleaned Alpaca, comparing only MP+SC (1% of data) against the Full-data baseline. Scores are computed from GPT-5 judgments.

Theorems & Definitions (11)

Remark : Necessity of $Q$
Remark : Gaussianity as a modeling device
Theorem 1
Theorem 2
Remark
Theorem 3
proof
Lemma 4: Matrix Determinant Lemma, woodbury1950inverting
Theorem 5
proof
...and 1 more

Greedy Information Projection for LLM Data Selection

Abstract

Greedy Information Projection for LLM Data Selection

Authors

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (11)