Data Proportion Detection for Optimized Data Management for Large Language Models

Hao Liang; Keshi Zhao; Yajie Yang; Bin Cui; Guosheng Dong; Zenan Zhou; Wentao Zhang

Data Proportion Detection for Optimized Data Management for Large Language Models

Hao Liang, Keshi Zhao, Yajie Yang, Bin Cui, Guosheng Dong, Zenan Zhou, Wentao Zhang

TL;DR

A new topic, data proportion detection, is introduced, which enables the automatic estimation of pre-training data proportions by analyzing the generated outputs of LLMs and provides rigorous theoretical proofs, practical algorithms, and preliminary experimental results for data proportion detection.

Abstract

Large language models (LLMs) have demonstrated exceptional performance across a wide range of tasks and domains, with data preparation playing a critical role in achieving these results. Pre-training data typically combines information from multiple domains. To maximize performance when integrating data from various domains, determining the optimal data proportion is essential. However, state-of-the-art (SOTA) LLMs rarely disclose details about their pre-training data, making it difficult for researchers to identify ideal data proportions. In this paper, we introduce a new topic, \textit{data proportion detection}, which enables the automatic estimation of pre-training data proportions by analyzing the generated outputs of LLMs. We provide rigorous theoretical proofs, practical algorithms, and preliminary experimental results for data proportion detection. Based on these findings, we offer valuable insights into the challenges and future directions for effective data proportion detection and data management.

Data Proportion Detection for Optimized Data Management for Large Language Models

TL;DR

Abstract

Paper Structure (29 sections, 3 theorems, 11 equations, 3 figures, 1 algorithm)

This paper contains 29 sections, 3 theorems, 11 equations, 3 figures, 1 algorithm.

Introduction
Preliminary
Pre-training Stage in LLMs
Data Domain Proportioning in LLMs
Data Preparation for LLMs
Data Preparation Systems
Data Quality
Data Cleaning
Problem Formulation
Data Proportion Detection
Theory of Data Proportion Detection
A Preliminary Algorithm for Data Proportion Detection
Preliminary Experiments
Experiment Setting
Base Model
...and 14 more sections

Key Result

Proposition 1

Let $y$ be a generated sentence in domain $D_i$. The probability that $y$ belongs to domain $D_i$ can be expressed as:

Figures (3)

Figure 1: Training data proportion
Figure 2: Detect data proportion
Figure 3: MAP-NEO synthetic data from <bos>. The data can be low quality and propose challenges for classification models. Hence we need strong classification models and data cleaning.

Theorems & Definitions (3)

Proposition 1: Probability of a Sentence Belonging to a Domain
Theorem 1: Data Mixing Law
Proposition 2: data proportion detection

Data Proportion Detection for Optimized Data Management for Large Language Models

TL;DR

Abstract

Data Proportion Detection for Optimized Data Management for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (3)