Table of Contents
Fetching ...

General Information Metrics for Improving AI Model Training Efficiency

Jianfeng Xu, Congcong Liu, Xiaoying Tan, Xiaojie Zhu, Anpeng Wu, Huan Wan, Weijun Kong, Chun Li, Hu Xu, Kun Kuang, Fei Wu

TL;DR

This work introduces General Information Metrics Evaluation (GIME), a pre-training data selection framework grounded in Objective Information Theory that uses 11 metrics to quantify training data quality. By thresholding high-sensitivity metrics before model training, GIME probabilistically yields datasets that preserve performance while dramatically reducing data size, training time, and energy use. Across CTR Prediction, Civil Case Prediction, Weather Forecasting, and a Judicial AI program, GIME demonstrates substantial efficiency gains with minimal performance degradation, and outperforms baseline full-data and random-sampling approaches, as well as an active-learning baseline. The proposed approach offers a domain-agnostic, theory-backed path toward sustainable and scalable AI development, with clear practical impact in large-scale and resource-constrained settings.

Abstract

To address the growing size of AI model training data and the lack of a universal data selection methodology-factors that significantly drive up training costs -- this paper presents the General Information Metrics Evaluation (GIME) method. GIME leverages general information metrics from Objective Information Theory (OIT), including volume, delay, scope, granularity, variety, duration, sampling rate, aggregation, coverage, distortion, and mismatch to optimize dataset selection for training purposes. Comprehensive experiments conducted across diverse domains, such as CTR Prediction, Civil Case Prediction, and Weather Forecasting, demonstrate that GIME effectively preserves model performance while substantially reducing both training time and costs. Additionally, applying GIME within the Judicial AI Program led to a remarkable 39.56% reduction in total model training expenses, underscoring its potential to support efficient and sustainable AI development.

General Information Metrics for Improving AI Model Training Efficiency

TL;DR

This work introduces General Information Metrics Evaluation (GIME), a pre-training data selection framework grounded in Objective Information Theory that uses 11 metrics to quantify training data quality. By thresholding high-sensitivity metrics before model training, GIME probabilistically yields datasets that preserve performance while dramatically reducing data size, training time, and energy use. Across CTR Prediction, Civil Case Prediction, Weather Forecasting, and a Judicial AI program, GIME demonstrates substantial efficiency gains with minimal performance degradation, and outperforms baseline full-data and random-sampling approaches, as well as an active-learning baseline. The proposed approach offers a domain-agnostic, theory-backed path toward sustainable and scalable AI development, with clear practical impact in large-scale and resource-constrained settings.

Abstract

To address the growing size of AI model training data and the lack of a universal data selection methodology-factors that significantly drive up training costs -- this paper presents the General Information Metrics Evaluation (GIME) method. GIME leverages general information metrics from Objective Information Theory (OIT), including volume, delay, scope, granularity, variety, duration, sampling rate, aggregation, coverage, distortion, and mismatch to optimize dataset selection for training purposes. Comprehensive experiments conducted across diverse domains, such as CTR Prediction, Civil Case Prediction, and Weather Forecasting, demonstrate that GIME effectively preserves model performance while substantially reducing both training time and costs. Additionally, applying GIME within the Judicial AI Program led to a remarkable 39.56% reduction in total model training expenses, underscoring its potential to support efficient and sustainable AI development.
Paper Structure (15 sections, 2 equations, 7 figures, 2 tables)

This paper contains 15 sections, 2 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Overview of General Information Metrics Supporting AI Model Training Data Selection. (a) Traditional AI model training often lacks effective methods for selecting optimal data, leading to significant resource costs. (b) By employing 11 general information metrics in the GIME algorithm, i.e., volume, delay, scope, granularity, variety, duration, sampling rate, aggregation, coverage, distortion, and mismatch, we can select AI training datasets and substantially reduce resource consumption during the training process. (c): This paper presents experiments that apply general information metrics to support AI model training in three scenario tasks and judicial program: (I) Click-Through Rate (CTR) Prediction, predicting the likelihood of a user clicking on a specific item, a key task in online advertising and recommendation systems. (II) Civil Case Prediction, using civil case data to forecast case types, supporting judges, lawyers, litigants, and researchers. (III) Weather Forecasting, predicting future weather conditions using meteorological data from multiple cities, a long-standing AI application. (IV) Judicial AI Program in training models for the cause of action prediction, case feature recognition, event extraction, judgment outcome reasoning, legal article recommendation, and judgment reasoning generation.
  • Figure 2: Work Pipeline of General Information Metrics Evaluation (GIME) Framework. The light gray shaded area represents the GIME framework, which comprises four modules. GIME selects data from the training data pool, and once the information metrics of the chosen dataset meet the threshold criteria, the AI model training process is initiated. Subsequently, the model’s performance is evaluated and analyzed using a test dataset.
  • Figure 3: Three Deep Models for Three Tasks: DNNs for CTR Prediction, ERNIE Model for Civil Case Prediction, and Time-Series Model for Weather Forecasting.
  • Figure 4: Correlation Experiments Between Training Dataset Information Metrics and AI Model Performance across Three Tasks. (I) CTR Prediction Experiment: Correlation experiments measured AUC against nine metrics, i.e., volume, delay, scope, granularity, variety, duration, sampling rate, aggregation, and mismatch. Note, in the mismatch experiments, the horizontal axis represents the training data’s CTR, as mismatch is defined as |CTR - 0.5| for intuitive visualization. (II) Civil Case Prediction Experiment: Correlation experiments measured Accuracy against nine metrics, i.e., volume, delay, scope, granularity, variety, aggregation, coverage, distortion, and mismatch. (III) Weather Forecasting Experiment: Correlation experiments measured MRMSE against nine metrics, i.e., volume, delay, scope, granularity, variety, duration, sampling rate, distortion, and mismatch.
  • Figure 5: Performance Comparison using GIME-based Dataset Selection across Three Distinct Tasks. (a) CTR Prediction: “Full” indicates models trained on the entire dataset, while S1–S10 denote GIME-selected subsets. (b) Civil Case Prediction: “Full” denotes full dataset training; S1–S10 are GIME subsets. (c) Weather Forecasting: Results for “Full,” GIME, and random sampling across CNN, LSTM, GRU, and Autoformer. In the CTR Prediction and Civil Case Prediction tasks, the dashed lines represent the average results, while the shaded areas denote the standard deviation.
  • ...and 2 more figures