Table of Contents
Fetching ...

A Metamorphic Testing Perspective on Knowledge Distillation for Language Models of Code: Does the Student Deeply Mimic the Teacher?

Md. Abdul Awal, Mrigank Rochan, Chanchal K. Roy

TL;DR

The paper addresses the gap that traditional accuracy metrics may overlook when compressing language models of code via knowledge distillation. It introduces MetaCompress, a metamorphic-testing framework that assesses behavioral fidelity between teacher and student by evaluating four output-based metamorphic relations—Prediction Agreement, Probability Distribution Similarity, High Confidence Preservation, and Calibration Alignment—under adversarial perturbations. Through experiments on CodeBERT and GraphCodeBERT distilled with Compressor, AVATAR, and MORPH across clone detection and vulnerability prediction, the authors show that students often match teacher accuracy yet differ significantly in robustness and internal behavior, with up to 62% of cases violating the proposed MRs and up to 285% greater performance drops under adversarial attacks. These findings argue for fidelity-aware distillation pipelines and demonstrate MetaCompress as a practical tool for diagnosing and guiding behavioral alignment in compressed language models of code.

Abstract

Transformer-based language models of code have achieved state-of-the-art performance across a wide range of software analytics tasks, but their practical deployment remains limited due to high computational costs, slow inference speeds, and significant environmental impact. To address these challenges, recent research has increasingly explored knowledge distillation as a method for compressing a large language model of code (the teacher) into a smaller model (the student) while maintaining performance. However, the degree to which a student model deeply mimics the predictive behavior and internal representations of its teacher remains largely unexplored, as current accuracy-based evaluation provides only a surface-level view of model quality and often fails to capture more profound discrepancies in behavioral fidelity between the teacher and student models. To address this gap, we empirically show that the student model often fails to deeply mimic the teacher model, resulting in up to 285% greater performance drop under adversarial attacks, which is not captured by traditional accuracy-based evaluation. Therefore, we propose MetaCompress, a metamorphic testing framework that systematically evaluates behavioral fidelity by comparing the outputs of teacher and student models under a set of behavior-preserving metamorphic relations. We evaluate MetaCompress on two widely studied tasks, using compressed versions of popular language models of code, obtained via three different knowledge distillation techniques: Compressor, AVATAR, and MORPH. The results show that MetaCompress identifies up to 62% behavioral discrepancies in student models, underscoring the need for behavioral fidelity evaluation within the knowledge distillation pipeline and establishing MetaCompress as a practical framework for testing compressed language models of code derived through knowledge distillation.

A Metamorphic Testing Perspective on Knowledge Distillation for Language Models of Code: Does the Student Deeply Mimic the Teacher?

TL;DR

The paper addresses the gap that traditional accuracy metrics may overlook when compressing language models of code via knowledge distillation. It introduces MetaCompress, a metamorphic-testing framework that assesses behavioral fidelity between teacher and student by evaluating four output-based metamorphic relations—Prediction Agreement, Probability Distribution Similarity, High Confidence Preservation, and Calibration Alignment—under adversarial perturbations. Through experiments on CodeBERT and GraphCodeBERT distilled with Compressor, AVATAR, and MORPH across clone detection and vulnerability prediction, the authors show that students often match teacher accuracy yet differ significantly in robustness and internal behavior, with up to 62% of cases violating the proposed MRs and up to 285% greater performance drops under adversarial attacks. These findings argue for fidelity-aware distillation pipelines and demonstrate MetaCompress as a practical tool for diagnosing and guiding behavioral alignment in compressed language models of code.

Abstract

Transformer-based language models of code have achieved state-of-the-art performance across a wide range of software analytics tasks, but their practical deployment remains limited due to high computational costs, slow inference speeds, and significant environmental impact. To address these challenges, recent research has increasingly explored knowledge distillation as a method for compressing a large language model of code (the teacher) into a smaller model (the student) while maintaining performance. However, the degree to which a student model deeply mimics the predictive behavior and internal representations of its teacher remains largely unexplored, as current accuracy-based evaluation provides only a surface-level view of model quality and often fails to capture more profound discrepancies in behavioral fidelity between the teacher and student models. To address this gap, we empirically show that the student model often fails to deeply mimic the teacher model, resulting in up to 285% greater performance drop under adversarial attacks, which is not captured by traditional accuracy-based evaluation. Therefore, we propose MetaCompress, a metamorphic testing framework that systematically evaluates behavioral fidelity by comparing the outputs of teacher and student models under a set of behavior-preserving metamorphic relations. We evaluate MetaCompress on two widely studied tasks, using compressed versions of popular language models of code, obtained via three different knowledge distillation techniques: Compressor, AVATAR, and MORPH. The results show that MetaCompress identifies up to 62% behavioral discrepancies in student models, underscoring the need for behavioral fidelity evaluation within the knowledge distillation pipeline and establishing MetaCompress as a practical framework for testing compressed language models of code derived through knowledge distillation.

Paper Structure

This paper contains 33 sections, 15 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Quality analysis of adversarial examples across various models and evaluation metrics.
  • Figure 2: Attack success rate metric values across adversarial attack techniques, models, tasks, and knowledge distillation methods.
  • Figure 3: Workflow of the MetaCompress framework, comparing teacher and student model outputs under behavior-preserving metamorphic relations to assess behavioral fidelity.
  • Figure 4: Behavioral fidelity discrepancies between the outputs of teacher and student models.
  • Figure 5: MR1 violation rates for different tasks, models, and knowledge distillation techniques.
  • ...and 2 more figures