Table of Contents
Fetching ...

Compressed models are NOT miniature versions of large models

Rohit Raj Rai, Rishant Pal, Amit Awekar

TL;DR

This work challenges the assumption that compressed models are miniature versions of large LNMs by conducting a cross-characteristic evaluation of BERT-large against five compressed variants across prediction errors, data representation, data distribution, and adversarial vulnerability. Using SQuAD2 for QA fine-tuning, NewsQA for out-of-distribution detection, and IMDB for sentiment with BERT-ATTACK, the authors demonstrate substantial divergence not only between the large model and compressed variants but also among the compressed models themselves. They quantify these differences with metrics like the $Jaccard$ coefficient for error sets, $K$ nearest neighbor$-$based data representations, and OOD agreement, revealing low cross-model similarity across all four characteristics. The findings imply that compression can cause nontrivial behavioral shifts, urging caution in deployment and motivating compression techniques that preserve multiple model characteristics beyond accuracy.

Abstract

Large neural models are often compressed before deployment. Model compression is necessary for many practical reasons, such as inference latency, memory footprint, and energy consumption. Compressed models are assumed to be miniature versions of corresponding large neural models. However, we question this belief in our work. We compare compressed models with corresponding large neural models using four model characteristics: prediction errors, data representation, data distribution, and vulnerability to adversarial attack. We perform experiments using the BERT-large model and its five compressed versions. For all four model characteristics, compressed models significantly differ from the BERT-large model. Even among compressed models, they differ from each other on all four model characteristics. Apart from the expected loss in model performance, there are major side effects of using compressed models to replace large neural models.

Compressed models are NOT miniature versions of large models

TL;DR

This work challenges the assumption that compressed models are miniature versions of large LNMs by conducting a cross-characteristic evaluation of BERT-large against five compressed variants across prediction errors, data representation, data distribution, and adversarial vulnerability. Using SQuAD2 for QA fine-tuning, NewsQA for out-of-distribution detection, and IMDB for sentiment with BERT-ATTACK, the authors demonstrate substantial divergence not only between the large model and compressed variants but also among the compressed models themselves. They quantify these differences with metrics like the coefficient for error sets, nearest neighborbased data representations, and OOD agreement, revealing low cross-model similarity across all four characteristics. The findings imply that compression can cause nontrivial behavioral shifts, urging caution in deployment and motivating compression techniques that preserve multiple model characteristics beyond accuracy.

Abstract

Large neural models are often compressed before deployment. Model compression is necessary for many practical reasons, such as inference latency, memory footprint, and energy consumption. Compressed models are assumed to be miniature versions of corresponding large neural models. However, we question this belief in our work. We compare compressed models with corresponding large neural models using four model characteristics: prediction errors, data representation, data distribution, and vulnerability to adversarial attack. We perform experiments using the BERT-large model and its five compressed versions. For all four model characteristics, compressed models significantly differ from the BERT-large model. Even among compressed models, they differ from each other on all four model characteristics. Apart from the expected loss in model performance, there are major side effects of using compressed models to replace large neural models.
Paper Structure (7 sections, 1 equation, 1 figure, 3 tables)

This paper contains 7 sections, 1 equation, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Variation in Data Representation agreement [(a) and (b)] and Data Distribution agreemtn [(c)] with change in the value of $K$