What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data Slicing

Chenyang Yang; Yining Hong; Grace A. Lewis; Tongshuang Wu; Christian Kästner

What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data Slicing

Chenyang Yang, Yining Hong, Grace A. Lewis, Tongshuang Wu, Christian Kästner

TL;DR

This work proposes SemSlicer, a framework that supports semantic data slicing, which identifies a semantically coherent slice, without the need for existing features, and shows that it generates accurate slices with low cost, allows flexible trade-offs between different design dimensions, reliably identifies under-performing data slices, and helps practitioners identify useful data slices that reflect systematic problems.

Abstract

Machine learning models make mistakes, yet sometimes it is difficult to identify the systematic problems behind the mistakes. Practitioners engage in various activities, including error analysis, testing, auditing, and red-teaming, to form hypotheses of what can go (or has gone) wrong with their models. To validate these hypotheses, practitioners employ data slicing to identify relevant examples. However, traditional data slicing is limited by available features and programmatic slicing functions. In this work, we propose SemSlicer, a framework that supports semantic data slicing, which identifies a semantically coherent slice, without the need for existing features. SemSlicer uses Large Language Models to annotate datasets and generate slices from any user-defined slicing criteria. We show that SemSlicer generates accurate slices with low cost, allows flexible trade-offs between different design dimensions, reliably identifies under-performing data slices, and helps practitioners identify useful data slices that reflect systematic problems.

What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data Slicing

TL;DR

Abstract

Paper Structure (53 sections, 5 figures, 4 tables)

This paper contains 53 sections, 5 figures, 4 tables.

Introduction
Data Slicing
Data slicing in ML engineering
Status quo and limitations
From crowdsourcing to automated semantic slicing
Semantic Data Slicing
Design Dimensions
Slicing accuracy needed
Slicing latency expected
Human effort available
Computational resources available
Design trade-off
System Design
System overview
Slicing function components
...and 38 more sections

Figures (5)

Figure 1: ML model quality assurance involves two stages: (1) Hypothesize and (2) Validate. Many activities focus on creating hypotheses, either explicitly (requirements analysis, error analysis) or implicitly in the process (testing, auditing, red-teaming). Data slicing helps validate the produced hypotheses by identifying additional relevant examples, often from evaluation and production data.
Figure 2: Existing practices mostly apply programmatic data slicing (cf. Table \ref{['tab:zeno']}). For our running example, we can use a simple regex to detect comments with the phrase "muslim" or "islam" (line \ref{['code:programmatic_slicing']}). In contrast, SemSlicer supports semantic data slicing, by using LLM and generated prompts as slicing functions for any user-provided criteria (line \ref{['code:semantic_slicing']}). The examples here are from the CivilComments dataset Borkan2019NuancedMF---they do not represent the authors' view.
Figure 3: SemSlicer's workflow: The user first specifies a slicing criterion (keywords, descriptions, etc.) and provides a dataset to slice. SemSlicer will ➀ construct and ➁ refine a classification instruction from the slicing criterion, optionally with human in the loop. SemSlicer will then ➂ sample and ➃ label few-shot examples, with ➄ synthetic examples generated if needed. Finally, SemSlicer uses the produced prompt to annotate the dataset and create the slices.
Figure 4: Prompt templates used in SemSlicer.
Figure 5: We visualize F1 score and cost of each configuration (for CivilComments, n=6k), which shows a clear trend of two trade-offs: whether to use few-shot examples (higher accuracy with higher compute), and whether to have human in the loop (higher accuracy with more human-effort).

What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data Slicing

TL;DR

Abstract

What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data Slicing

Authors

TL;DR

Abstract

Table of Contents

Figures (5)