Table of Contents
Fetching ...

GRIP: Geometric Refinement and Adaptive Information Potential for Data Efficiency

Changhao Wang, Jiaolong Yang, Xinhao Yao, Yunfei Yu, Peng Jiao, Lu Yu, Junpeng Fang, Riccardo Cantoro, Qing Cui, Jun Zhou

TL;DR

This work establishes a robust geometric foundation for adaptive data curation in large-scale pre-training by introducing GRIP (Geometric Refinement and Adaptive Information Potential), a framework that unifies these dimensions by modeling the corpus as an information-dense geometric space.

Abstract

The performance of Large Language Models (LLMs) is increasingly governed by data efficiency rather than raw scaling volume. However, existing selection methods often decouple global distribution balancing from local instance selection, compromising the hierarchical integrity of the training set. We introduce \textbf{GRIP} (Geometric Refinement and Adaptive Information Potential), a framework that unifies these dimensions by modeling the corpus as an information-dense geometric space. GRIP employs a \textbf{Rapid Adaptation Probe (RAP)} to quantify the information potential of semantic clusters, dynamically re-allocating the sampling budget to regions with the highest representation deficits. Subsequently, we perform Intra-Cluster Selection using a \textbf{length-rectified geometric prior} to counteract embedding density artifacts and preserve long-tail logical sequences. Extensive evaluations on Mixture-of-Experts (MoE) models up to 300B tokens demonstrate that GRIP consistently outperforms state-of-the-art baselines, \textbf{surpassing the performance of models trained on $3\times$ larger uncurated datasets}. Our work establishes a robust geometric foundation for adaptive data curation in large-scale pre-training.

GRIP: Geometric Refinement and Adaptive Information Potential for Data Efficiency

TL;DR

This work establishes a robust geometric foundation for adaptive data curation in large-scale pre-training by introducing GRIP (Geometric Refinement and Adaptive Information Potential), a framework that unifies these dimensions by modeling the corpus as an information-dense geometric space.

Abstract

The performance of Large Language Models (LLMs) is increasingly governed by data efficiency rather than raw scaling volume. However, existing selection methods often decouple global distribution balancing from local instance selection, compromising the hierarchical integrity of the training set. We introduce \textbf{GRIP} (Geometric Refinement and Adaptive Information Potential), a framework that unifies these dimensions by modeling the corpus as an information-dense geometric space. GRIP employs a \textbf{Rapid Adaptation Probe (RAP)} to quantify the information potential of semantic clusters, dynamically re-allocating the sampling budget to regions with the highest representation deficits. Subsequently, we perform Intra-Cluster Selection using a \textbf{length-rectified geometric prior} to counteract embedding density artifacts and preserve long-tail logical sequences. Extensive evaluations on Mixture-of-Experts (MoE) models up to 300B tokens demonstrate that GRIP consistently outperforms state-of-the-art baselines, \textbf{surpassing the performance of models trained on larger uncurated datasets}. Our work establishes a robust geometric foundation for adaptive data curation in large-scale pre-training.
Paper Structure (35 sections, 10 equations, 5 figures, 1 table)

This paper contains 35 sections, 10 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Overview of the GRIP Framework. GRIP unifies Inter-Cluster Budgeting and Intra-Cluster Selection through a hierarchical geometric optimization: (1) Geometric Probing: We partition the corpus into semantic clusters and construct a Neyman-optimal probe set $\mathcal{P}$ based on Geometric Consistency ($\sigma_k$) to estimate the baseline budget $n_k^{base}$. (2) Dynamic Allocation: By monitoring the Adaptation Delta ($\Delta \mathcal{L}_k$), GRIP identifies representation deficits and dynamically re-allocates resources from saturated regions to high-potential clusters via a replay multiplier $r_k$. (3) Rectified Sampling: Within clusters, we employ density-based selection with a Length-Rectification Term to counteract embedding collapse, preserving both global structural variance and long-tail logical sequences sorscher2022beyondethayarajh2019contextual.
  • Figure 2: Analysis of Sequence Length Pathology.(Left) The normalized distance to the $k$-nearest neighbors ($k=10$) decreases rapidly as sequence length grows, indicating that embeddings of long sequences collapse into a narrow, dense region (anisotropic cone). (Right) The sample ratio reveals a severe heavy-tailed imbalance: data density vanishes for long sequences (power-law tail), leading to insufficient supervision for extended contexts.
  • Figure 3: Mechanism of the Rapid Adaptation Probe. To evaluate the learnability of different clusters (e.g., $C_0, C_1, C_2$), we freeze the lower layers and reset the Retraining Layers to a consistent initialization. We then perform $N$-step gradient descent independently for each cluster. The resulting Adaptation Delta ($\Delta \mathcal{L}_k$) measures how quickly the loss drops from this common starting point. A rapid loss reduction (e.g., $C_0$) indicates that the data is easily predictable given the current features, implying low incremental information gain, while a small drop (e.g., $C_2$) indicates a learning bottleneck requiring increased replay budget.
  • Figure 4: Independence of Static Quality and Training Dynamics.(Left) Scatter plot where each data point represents a cluster, showing a weak correlation (Pearson $\approx -0.202$) between LLM-based quality scores $Q_k$ and the adaptation delta $\Delta \mathcal{L}_k$. (Right) Density distributions across four dimensions illustrate that high-value semantic features exhibit distinct spectral signatures.
  • Figure 5: Cross-Model Transferability of Loss Dynamics. Loss trajectories across different model families (Qwen-2.5 vs. SmolLM) and reset depths ($N \in \{0, 1, 2, 4\}$). The high consistency in ranking suggests that lightweight proxy models can effectively guide data selection for larger target architectures.