Table of Contents
Fetching ...

Tracing Distribution Shifts with Causal System Maps

Joran Leest, Ilias Gerostathopoulos, Patricia Lago, Claudia Raibulet

TL;DR

The paper tackles the problem of attributing distribution shifts in ML systems to their causes rather than merely detecting that a shift has occurred. It proposes ML System Maps, a hierarchical, causal, view-based representation that explicitly models the environment and internal subsystems to trace shifts. The approach defines a notation and three coordinated views, and develops AQ1–AQ3 attribution patterns to route, localize, and externalize causes, with potential use of Shapley-based attribution. If validated, ML System Maps could improve incident response, cross-team collaboration, and the reliability of ML deployments by providing structured, end-to-end shift explanation.

Abstract

Monitoring machine learning (ML) systems is hard, with standard practice focusing on detecting distribution shifts rather than their causes. Root-cause analysis often relies on manual tracing to determine whether a shift is caused by software faults, data-quality issues, or natural change. We propose ML System Maps -- causal maps that, through layered views, make explicit the propagation paths between the environment and the ML system's internals, enabling systematic attribution of distribution shifts. We outline the approach and a research agenda for its development and evaluation.

Tracing Distribution Shifts with Causal System Maps

TL;DR

The paper tackles the problem of attributing distribution shifts in ML systems to their causes rather than merely detecting that a shift has occurred. It proposes ML System Maps, a hierarchical, causal, view-based representation that explicitly models the environment and internal subsystems to trace shifts. The approach defines a notation and three coordinated views, and develops AQ1–AQ3 attribution patterns to route, localize, and externalize causes, with potential use of Shapley-based attribution. If validated, ML System Maps could improve incident response, cross-team collaboration, and the reliability of ML deployments by providing structured, end-to-end shift explanation.

Abstract

Monitoring machine learning (ML) systems is hard, with standard practice focusing on detecting distribution shifts rather than their causes. Root-cause analysis often relies on manual tracing to determine whether a shift is caused by software faults, data-quality issues, or natural change. We propose ML System Maps -- causal maps that, through layered views, make explicit the propagation paths between the environment and the ML system's internals, enabling systematic attribution of distribution shifts. We outline the approach and a research agenda for its development and evaluation.

Paper Structure

This paper contains 18 sections, 2 figures.

Figures (2)

  • Figure 1: Illustrative ML System Map composed of five views: Environment, ML System, and three Subsystem views (data pipeline for activity features, model serving for churn, application for outreach). Nodes represent variables: data (D), modulators (M), and random (R). Edges denote relations: causal (solid arrows), mapping (dashed lines), and causal mapping (dashed arrows).
  • Figure 2: Illustrative scenarios showing how a shift in the promotion rankings traces by sequentially traversing views (attribution questions AQ1-3) via dashed mapping lines (same quantity across views). Branching paths show scenarios based on where mass concentrates (red outlines), each an attribution pattern (AP). Unlabeled nodes show variables around the implicated variable.