Table of Contents
Fetching ...

Maturity Framework for Enhancing Machine Learning Quality

Angelantonio Castelli, Georgios Christos Chouliaras, Dmitri Goldenberg

TL;DR

This work addresses the challenge of ensuring high-quality, reproducible ML systems through a structured Quality Assessment and Maturity Framework. It defines seven quality characteristics and a quantified scoring approach, supported by an open-source Python package and an ML Registry for automated, scalable evaluation. The framework is validated in a Booking.com deployment, demonstrating measurable improvements in quality and business outcomes, and is presented as a practical pathway to stronger ML governance. The authors argue that this approach can reshape industry standards and be extended to evolving ML paradigms such as GenAI, with ongoing refinements to tooling and domain-specific criteria.

Abstract

With the rapid integration of Machine Learning (ML) in business applications and processes, it is crucial to ensure the quality, reliability and reproducibility of such systems. We suggest a methodical approach towards ML system quality assessment and introduce a structured Maturity framework for governance of ML. We emphasize the importance of quality in ML and the need for rigorous assessment, driven by issues in ML governance and gaps in existing frameworks. Our primary contribution is a comprehensive open-sourced quality assessment method, validated with empirical evidence, accompanied by a systematic maturity framework tailored to ML systems. Drawing from applied experience at Booking.com, we discuss challenges and lessons learned during large-scale adoption within organizations. The study presents empirical findings, highlighting quality improvement trends and showcasing business outcomes. The maturity framework for ML systems, aims to become a valuable resource to reshape industry standards and enable a structural approach to improve ML maturity in any organization.

Maturity Framework for Enhancing Machine Learning Quality

TL;DR

This work addresses the challenge of ensuring high-quality, reproducible ML systems through a structured Quality Assessment and Maturity Framework. It defines seven quality characteristics and a quantified scoring approach, supported by an open-source Python package and an ML Registry for automated, scalable evaluation. The framework is validated in a Booking.com deployment, demonstrating measurable improvements in quality and business outcomes, and is presented as a practical pathway to stronger ML governance. The authors argue that this approach can reshape industry standards and be extended to evolving ML paradigms such as GenAI, with ongoing refinements to tooling and domain-specific criteria.

Abstract

With the rapid integration of Machine Learning (ML) in business applications and processes, it is crucial to ensure the quality, reliability and reproducibility of such systems. We suggest a methodical approach towards ML system quality assessment and introduce a structured Maturity framework for governance of ML. We emphasize the importance of quality in ML and the need for rigorous assessment, driven by issues in ML governance and gaps in existing frameworks. Our primary contribution is a comprehensive open-sourced quality assessment method, validated with empirical evidence, accompanied by a systematic maturity framework tailored to ML systems. Drawing from applied experience at Booking.com, we discuss challenges and lessons learned during large-scale adoption within organizations. The study presents empirical findings, highlighting quality improvement trends and showcasing business outcomes. The maturity framework for ML systems, aims to become a valuable resource to reshape industry standards and enable a structural approach to improve ML maturity in any organization.

Paper Structure

This paper contains 37 sections, 1 equation, 5 figures, 1 table.

Figures (5)

  • Figure 1: Quality (Y-axis) and maturity (markers) score of a selected subset of systems over iterations.
  • Figure 2: Violin plot of the ML quality score for different months. Each shape represents the score distribution across evaluated models. The white mark indicates the median.
  • Figure 3: Percentage of systems complying with the requirements before and after the framework rollout
  • Figure 4: Example of a report for a fully mature system. No technical gaps are present, all the fulfilled quality attributes are listed in green
  • Figure 5: Example of a report for a system of maturity level 1. The gaps to be fulfilled to pass to the next maturity level are shown in red. The quality attributes to be fulfilled for the subsequent maturity levels are shown in orange. Below each quality attribute the user can see both the motivation of a certain technical gap and a recommendation to remove it.