Table of Contents
Fetching ...

Fill In The Gaps: Model Calibration and Generalization with Synthetic Data

Yang Ba, Michelle V. Mancenido, Rong Pan

TL;DR

A calibration method that incorporates synthetic data without compromising accuracy and derives the expected calibration error (ECE) bound using the Probably Approximately Correct (PAC) learning framework is proposed.

Abstract

As machine learning models continue to swiftly advance, calibrating their performance has become a major concern prior to practical and widespread implementation. Most existing calibration methods often negatively impact model accuracy due to the lack of diversity of validation data, resulting in reduced generalizability. To address this, we propose a calibration method that incorporates synthetic data without compromising accuracy. We derive the expected calibration error (ECE) bound using the Probably Approximately Correct (PAC) learning framework. Large language models (LLMs), known for their ability to mimic real data and generate text with mixed class labels, are utilized as a synthetic data generation strategy to lower the ECE bound and improve model accuracy on real test data. Additionally, we propose data generation mechanisms for efficient calibration. Testing our method on four different natural language processing tasks, we observed an average up to 34\% increase in accuracy and 33\% decrease in ECE.

Fill In The Gaps: Model Calibration and Generalization with Synthetic Data

TL;DR

A calibration method that incorporates synthetic data without compromising accuracy and derives the expected calibration error (ECE) bound using the Probably Approximately Correct (PAC) learning framework is proposed.

Abstract

As machine learning models continue to swiftly advance, calibrating their performance has become a major concern prior to practical and widespread implementation. Most existing calibration methods often negatively impact model accuracy due to the lack of diversity of validation data, resulting in reduced generalizability. To address this, we propose a calibration method that incorporates synthetic data without compromising accuracy. We derive the expected calibration error (ECE) bound using the Probably Approximately Correct (PAC) learning framework. Large language models (LLMs), known for their ability to mimic real data and generate text with mixed class labels, are utilized as a synthetic data generation strategy to lower the ECE bound and improve model accuracy on real test data. Additionally, we propose data generation mechanisms for efficient calibration. Testing our method on four different natural language processing tasks, we observed an average up to 34\% increase in accuracy and 33\% decrease in ECE.

Paper Structure

This paper contains 3 sections, 5 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Generating synthetic data to address miscalibration gaps. In (a), the target bin for calibration is identified as Low Probability & Overconfidence. Synthetic data is generated away from the decision boundary in (b).
  • Figure 2: The iterative process of enhancing the accuracy and calibration of a 1D logistic regression model is demonstrated. Initially, the model is fitted using observed data (a), followed by the creation of its reliability diagram to identify poorly calibrated bins (b). Next, synthetic data points are strategically added to two targeted bins and the model is refitted. This iterative approach results in the model closely approximating the true logistic curve (c), thereby improving the calibration in the reliability diagram (d).