Table of Contents
Fetching ...

An Empirical Framework for Evaluating Semantic Preservation Using Hugging Face

Nan Jia, Anita Raja, Raffi Khatchadourian

TL;DR

The paper tackles the challenge of ensuring semantic preservation as ML components evolve within LESS. It presents an empirical framework that mines Hugging Face model evolution, using Model Cards and commit histories to measure metric stability as a proxy for semantic preservation. A large-scale pipeline analyzes 536 models and 4,297 metrics, supplemented by case studies in image, tabular, and reinforcement learning tasks to illustrate drift and preservation patterns. The work provides an important baseline for trustworthy ML maintenance, emphasizing automated metric extraction, intra-repository signals, and the critical role of documentation quality. The findings offer practical insights for defining stability thresholds and guiding maintainability efforts in evolving ML systems.

Abstract

As machine learning (ML) becomes an integral part of high-autonomy systems, it is critical to ensure the trustworthiness of learning-enabled software systems (LESS). Yet, the nondeterministic and run-time-defined semantics of ML complicate traditional software refactoring. We define semantic preservation in LESS as the property that optimizations of intelligent components do not alter the system's overall functional behavior. This paper introduces an empirical framework to evaluate semantic preservation in LESS by mining model evolution data from HuggingFace. We extract commit histories, $\textit{Model Cards}$, and performance metrics from a large number of models. To establish baselines, we conducted case studies in three domains, tracing performance changes across versions. Our analysis demonstrates how $\textit{semantic drift}$ can be detected via evaluation metrics across commits and reveals common refactoring patterns based on commit message analysis. Although API constraints limited the possibility of estimating a full-scale threshold, our pipeline offers a foundation for defining community-accepted boundaries for semantic preservation. Our contributions include: (1) a large-scale dataset of ML model evolution, curated from 1.7 million Hugging Face entries via a reproducible pipeline using the native HF hub API, (2) a practical pipeline for the evaluation of semantic preservation for a subset of 536 models and 4000+ metrics and (3) empirical case studies illustrating semantic drift in practice. Together, these contributions advance the foundations for more maintainable and trustworthy ML systems.

An Empirical Framework for Evaluating Semantic Preservation Using Hugging Face

TL;DR

The paper tackles the challenge of ensuring semantic preservation as ML components evolve within LESS. It presents an empirical framework that mines Hugging Face model evolution, using Model Cards and commit histories to measure metric stability as a proxy for semantic preservation. A large-scale pipeline analyzes 536 models and 4,297 metrics, supplemented by case studies in image, tabular, and reinforcement learning tasks to illustrate drift and preservation patterns. The work provides an important baseline for trustworthy ML maintenance, emphasizing automated metric extraction, intra-repository signals, and the critical role of documentation quality. The findings offer practical insights for defining stability thresholds and guiding maintainability efforts in evolving ML systems.

Abstract

As machine learning (ML) becomes an integral part of high-autonomy systems, it is critical to ensure the trustworthiness of learning-enabled software systems (LESS). Yet, the nondeterministic and run-time-defined semantics of ML complicate traditional software refactoring. We define semantic preservation in LESS as the property that optimizations of intelligent components do not alter the system's overall functional behavior. This paper introduces an empirical framework to evaluate semantic preservation in LESS by mining model evolution data from HuggingFace. We extract commit histories, , and performance metrics from a large number of models. To establish baselines, we conducted case studies in three domains, tracing performance changes across versions. Our analysis demonstrates how can be detected via evaluation metrics across commits and reveals common refactoring patterns based on commit message analysis. Although API constraints limited the possibility of estimating a full-scale threshold, our pipeline offers a foundation for defining community-accepted boundaries for semantic preservation. Our contributions include: (1) a large-scale dataset of ML model evolution, curated from 1.7 million Hugging Face entries via a reproducible pipeline using the native HF hub API, (2) a practical pipeline for the evaluation of semantic preservation for a subset of 536 models and 4000+ metrics and (3) empirical case studies illustrating semantic drift in practice. Together, these contributions advance the foundations for more maintainable and trustworthy ML systems.

Paper Structure

This paper contains 14 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: End-to-end pipeline for metadata filtering, metric extraction, and semantic drift detection.
  • Figure 2: Refactoring Patterns of sampling data---log-scale chart showing the distribution of top refactoring-related keywords in sampled HF model repositories. The vast majority of relevant commit messages use the term “update", while explicit mentions of “refactor", “optimized", or “security" are rare, highlighting an imbalance in how NF changes are documented.
  • Figure 3: “Acceptable" change in Accuracy and Training Loss. (a) and (b) illustrate the boundaries of “acceptable" change in accuracy and training loss in a classification task. These plots show that the majority of models exhibit semantic preservation (within $\pm0.15$ for accuracy or $\pm0.13$ for training loss), which aligns with our statistical findings. The CIs, which include zero, suggest that the semantics are preserved while observed changes are not statistically significant.
  • Figure 4: Semantic Drifts in Different Tasks. (a) shows the accuracy of an image classification model. The initial accuracy of the pre-trained model on the target domain was 18%. After fine-tuning with new data, the accuracy improved to 58.9%. (b) displays the accuracy of a tabular data classification. The model's performance fluctuated, starting at 85.5% on 11/5/2024 and decreasing to 84.8% on 11/25/2024. (c) presents the mean reward for a reinforcement learning task. The model's performance improved significantly, with a total gain of $31.1\pm 7.88$. The shaded areas ($\Delta$ accuracy) indicate the uncertainty for each metric's value, reflecting intentional, non-breaking refactorings for task-specific optimization. Accuracy is used as the primary metric for classification tasks due to space limitations. Precision, recall, and f1-score followed similar trends