Table of Contents
Fetching ...

Downsized and Compromised?: Assessing the Faithfulness of Model Compression

Moumita Kamal, Douglas A. Talbert

TL;DR

This paper argues that evaluating compressed models should go beyond size and accuracy to include faithfulness to the original model, especially in high-stakes domains. It introduces a faithfulness framework based on model agreement and chi-squared tests to detect non-random shifts in predictions and bias, evaluated across three socially meaningful datasets using quantization and pruning. The findings show that high accuracy does not guarantee faithful behavior; quantization generally preserves faithfulness better than pruning, though subtle subgroup-specific shifts can remain. The work provides a practical diagnostic toolkit for deploying efficient, trustworthy AI in resource-constrained environments and lays the groundwork for broader use of agreement-and-bias-based metrics in model compression workflows.

Abstract

In real-world applications, computational constraints often require transforming large models into smaller, more efficient versions through model compression. While these techniques aim to reduce size and computational cost without sacrificing performance, their evaluations have traditionally focused on the trade-off between size and accuracy, overlooking the aspect of model faithfulness. This limited view is insufficient for high-stakes domains like healthcare, finance, and criminal justice, where compressed models must remain faithful to the behavior of their original counterparts. This paper presents a novel approach to evaluating faithfulness in compressed models, moving beyond standard metrics. We introduce and demonstrate a set of faithfulness metrics that capture how model behavior changes post-compression. Our contributions include introducing techniques to assess predictive consistency between the original and compressed models using model agreement, and applying chi-squared tests to detect statistically significant changes in predictive patterns across both the overall dataset and demographic subgroups, thereby exposing shifts that aggregate fairness metrics may obscure. We demonstrate our approaches by applying quantization and pruning to artificial neural networks (ANNs) trained on three diverse and socially meaningful datasets. Our findings show that high accuracy does not guarantee faithfulness, and our statistical tests detect subtle yet significant shifts that are missed by standard metrics, such as Accuracy and Equalized Odds. The proposed metrics provide a practical and more direct method for ensuring that efficiency gains through compression do not compromise the fairness or faithfulness essential for trustworthy AI.

Downsized and Compromised?: Assessing the Faithfulness of Model Compression

TL;DR

This paper argues that evaluating compressed models should go beyond size and accuracy to include faithfulness to the original model, especially in high-stakes domains. It introduces a faithfulness framework based on model agreement and chi-squared tests to detect non-random shifts in predictions and bias, evaluated across three socially meaningful datasets using quantization and pruning. The findings show that high accuracy does not guarantee faithful behavior; quantization generally preserves faithfulness better than pruning, though subtle subgroup-specific shifts can remain. The work provides a practical diagnostic toolkit for deploying efficient, trustworthy AI in resource-constrained environments and lays the groundwork for broader use of agreement-and-bias-based metrics in model compression workflows.

Abstract

In real-world applications, computational constraints often require transforming large models into smaller, more efficient versions through model compression. While these techniques aim to reduce size and computational cost without sacrificing performance, their evaluations have traditionally focused on the trade-off between size and accuracy, overlooking the aspect of model faithfulness. This limited view is insufficient for high-stakes domains like healthcare, finance, and criminal justice, where compressed models must remain faithful to the behavior of their original counterparts. This paper presents a novel approach to evaluating faithfulness in compressed models, moving beyond standard metrics. We introduce and demonstrate a set of faithfulness metrics that capture how model behavior changes post-compression. Our contributions include introducing techniques to assess predictive consistency between the original and compressed models using model agreement, and applying chi-squared tests to detect statistically significant changes in predictive patterns across both the overall dataset and demographic subgroups, thereby exposing shifts that aggregate fairness metrics may obscure. We demonstrate our approaches by applying quantization and pruning to artificial neural networks (ANNs) trained on three diverse and socially meaningful datasets. Our findings show that high accuracy does not guarantee faithfulness, and our statistical tests detect subtle yet significant shifts that are missed by standard metrics, such as Accuracy and Equalized Odds. The proposed metrics provide a practical and more direct method for ensuring that efficiency gains through compression do not compromise the fairness or faithfulness essential for trustworthy AI.

Paper Structure

This paper contains 43 sections, 2 equations, 12 figures, 11 tables.

Figures (12)

  • Figure 1: Difference in model size per dataset.
  • Figure 2: Change in validation and test accuracy over 10 iterations on the COMPAS dataset for Baseline, Quantized, and Pruned models.
  • Figure 3: Average model accuracy per dataset
  • Figure 4: Agreement Statistic Matrix - COMPAS (Baseline vs Quantized model)
  • Figure 5: Number of times p-values were above/below threshold for each compressed model across all datasets Note: Points below threshold (0.05) are considered 'bad' compressions
  • ...and 7 more figures