Model Monitoring in the Absence of Labeled Data via Feature Attributions Distributions
Carlos Mougan
TL;DR
This work develops an unsupervised framework for monitoring deployed ML models when labeled data is unavailable, by leveraging distributions of feature attributions (notably SHAP and LIME) to study both AI alignment and performance monitoring under distribution shift. It formalizes Equal Treatment, arguing that independence between explanation distributions and protected attributes provides a stricter, more philosophically grounded fairness notion than traditional Demographic Parity, and introduces the Equal Treatment Inspector, a classifier-two-sample-test-based tool. The thesis introduces the concept of explanation shifts to capture how model explanations change under distribution shifts and integrates an explainable-uncertainty approach to identify drivers of model deterioration. It delivers open-source software (explanationspace, skshift) and extensive empirical validation on synthetic and real tabular data (e.g., ACS, StackOverflow), demonstrating that explanation-based monitoring can detect shifts and diagnose fairness concerns more sensitively than input- or output-based metrics. The work concludes with reflections on limitations, reliability of explanations, and future directions for extending this framework to broader domains and real-world applications, highlighting the ethical implications of aligning ML systems with liberal and Kantian fairness ideals.
Abstract
Model monitoring involves analyzing AI algorithms once they have been deployed and detecting changes in their behaviour. This thesis explores machine learning model monitoring ML before the predictions impact real-world decisions or users. This step is characterized by one particular condition: the absence of labelled data at test time, which makes it challenging, even often impossible, to calculate performance metrics. The thesis is structured around two main themes: (i) AI alignment, measuring if AI models behave in a manner consistent with human values and (ii) performance monitoring, measuring if the models achieve specific accuracy goals or desires. The thesis uses a common methodology that unifies all its sections. It explores feature attribution distributions for both monitoring dimensions. Using these feature attribution explanations, we can exploit their theoretical properties to derive and establish certain guarantees and insights into model monitoring.
