Table of Contents
Fetching ...

Greedy Information Projection for LLM Data Selection

Victor Ye Dong, Kuan-Yun Lee, Jiamei Shuai, Shengfei Liu, Yi Liu, Jian Jiao

Abstract

We present \emph{Greedy Information Projection} (\textsc{GIP}), a principled framework for choosing training examples for large language model fine-tuning. \textsc{GIP} casts selection as maximizing mutual information between a subset of examples and task-specific query signals, which may originate from LLM quality judgments, metadata, or other sources. The framework involves optimizing a closed-form mutual information objective defined using both data and query embeddings, naturally balancing {\it quality} and {\it diversity}. Optimizing this score is equivalent to maximizing the projection of the query embedding matrix onto the span of the selected data, which provides a geometric explanation for the co-emergence of quality and diversity. Building on this view, we employ a fast greedy matching-pursuit procedure with efficient projection-based updates. On instruction-following and mathematical reasoning datasets, \textsc{GIP} selects small subsets that match full-data fine-tuning while using only a fraction of examples and compute, unifying quality-aware and diversity-aware selection for efficient fine-tuning.

Greedy Information Projection for LLM Data Selection

Abstract

We present \emph{Greedy Information Projection} (\textsc{GIP}), a principled framework for choosing training examples for large language model fine-tuning. \textsc{GIP} casts selection as maximizing mutual information between a subset of examples and task-specific query signals, which may originate from LLM quality judgments, metadata, or other sources. The framework involves optimizing a closed-form mutual information objective defined using both data and query embeddings, naturally balancing {\it quality} and {\it diversity}. Optimizing this score is equivalent to maximizing the projection of the query embedding matrix onto the span of the selected data, which provides a geometric explanation for the co-emergence of quality and diversity. Building on this view, we employ a fast greedy matching-pursuit procedure with efficient projection-based updates. On instruction-following and mathematical reasoning datasets, \textsc{GIP} selects small subsets that match full-data fine-tuning while using only a fraction of examples and compute, unifying quality-aware and diversity-aware selection for efficient fine-tuning.
Paper Structure (53 sections, 5 theorems, 37 equations, 2 figures, 23 tables, 1 algorithm)

This paper contains 53 sections, 5 theorems, 37 equations, 2 figures, 23 tables, 1 algorithm.

Key Result

Theorem 1

Maximizing mutual information defined in equation eq:mutual-inf-def is equivalent to optimizing

Figures (2)

  • Figure 1: Left: GSM8K performance comparison on Qwen3-8B (left panel) and Mistral-7B (right panel) across different training data percentages (2.5%, 5%, 10%, 20%). Our proposed methods MP+MA and MP+SC are competitive with strong baselines (Random, DSIR, DISF, LESS), and often improve over Random/DISF/LESS. In several settings, they approach the full-dataset (100%, shown as a horizontal reference line) performance at 10%--20% of training data, demonstrating strong data efficiency. Right: Geometric interpretation of GIP. The method maximizes mutual information between Gaussian projections induced by the data embedding matrix $F$ and score embedding matrix $Q$. This is equivalent to minimizing the volume (determinant) of score embeddings projected onto the null space of selected data, naturally balancing quality (high-score items) and diversity (new directions in embedding space).
  • Figure 2: MT-Bench per-category average scores on Mistral-7B under cleaned vs non-cleaned Alpaca, comparing only MP+SC (1% of data) against the Full-data baseline. Scores are computed from GPT-5 judgments.

Theorems & Definitions (11)

  • Remark : Necessity of $Q$
  • Remark : Gaussianity as a modeling device
  • Theorem 1
  • Theorem 2
  • Remark
  • Theorem 3
  • proof
  • Lemma 4: Matrix Determinant Lemma, woodbury1950inverting
  • Theorem 5
  • proof
  • ...and 1 more