LabelAId: Just-in-time AI Interventions for Improving Human Labeling Quality and Domain Knowledge in Crowdsourcing Systems

Chu Li; Zhihan Zhang; Michael Saugstad; Esteban Safranchik; Minchu Kulkarni; Xiaoyu Huang; Shwetak Patel; Vikram Iyer; Tim Althoff; Jon E. Froehlich

LabelAId: Just-in-time AI Interventions for Improving Human Labeling Quality and Domain Knowledge in Crowdsourcing Systems

Chu Li, Zhihan Zhang, Michael Saugstad, Esteban Safranchik, Minchu Kulkarni, Xiaoyu Huang, Shwetak Patel, Vikram Iyer, Tim Althoff, Jon E. Froehlich

TL;DR

This paper introduces LabelAId, an advanced inference model combining Programmatic Weak Supervision (PWS) with FT-Transformers to infer label correctness based on user behavior and domain knowledge and implemented LabelAId into Project Sidewalk, an open-source crowdsourcing platform for urban accessibility.

Abstract

Crowdsourcing platforms have transformed distributed problem-solving, yet quality control remains a persistent challenge. Traditional quality control measures, such as prescreening workers and refining instructions, often focus solely on optimizing economic output. This paper explores just-in-time AI interventions to enhance both labeling quality and domain-specific knowledge among crowdworkers. We introduce LabelAId, an advanced inference model combining Programmatic Weak Supervision (PWS) with FT-Transformers to infer label correctness based on user behavior and domain knowledge. Our technical evaluation shows that our LabelAId pipeline consistently outperforms state-of-the-art ML baselines, improving mistake inference accuracy by 36.7% with 50 downstream samples. We then implemented LabelAId into Project Sidewalk, an open-source crowdsourcing platform for urban accessibility. A between-subjects study with 34 participants demonstrates that LabelAId significantly enhances label precision without compromising efficiency while also increasing labeler confidence. We discuss LabelAId's success factors, limitations, and its generalizability to other crowdsourced science domains.

LabelAId: Just-in-time AI Interventions for Improving Human Labeling Quality and Domain Knowledge in Crowdsourcing Systems

TL;DR

Abstract

Paper Structure (41 sections, 1 equation, 10 figures, 10 tables, 1 algorithm)

This paper contains 41 sections, 1 equation, 10 figures, 10 tables, 1 algorithm.

Introduction
Related Work
Improving Quality of Crowdsourced Labels
Teachable Moments in Crowdsourcing for Community Science
Machine Learning to Infer Label Correctness
LabelAId: A Label Correctness Inference Framework
LabelAId Pipeline
Programmatic Weak Supervision (PWS)
Transfer Learning from AIA to Expert-Validated Labels
Applying LabelAId to Project Sidewalk
Dataset Description
Input Features & Labeling Functions
Multi-city Pre-training
Fine-tuning on a Specific City
Technical Evaluation
...and 26 more sections

Figures (10)

Figure 1: An overview of our LabelAId pipeline. Programmatic weak supervision, utilizing domain-specific knowledge and heuristics, is employed to annotate the raw data. Subsequently, the automatically imperfectly annotated data generated from PWS are used to pre-train the inference model. Lastly, the inference model is fine-tuned using expert-validated labels for the target downstream task. Diagram adapted from ratner_snorkel_2017.
Figure 2: (A) Project Sidewalk Labeling Interface. (B) Project Sidewalk Label Types. (C) Examples of Project Sidewalk severity ratings for surface problems. Severity 5 is the most severe, indicating a scenario impassable by wheelchair users.
Figure 3: Conceptual diagram of our FT-Transformer-based model architecture. First, the model transforms the hybrid features (e.g., two numerical and two categorical features) into unified embeddings. Subsequently, these embeddings are processed iteratively by the Transformer layer. The final output is based on the [CLS] token. Diagram adapted from gorishniy_revisiting_2021.
Figure 4: Overall performance of our LabelAId pipeline compared to the traditional ML methods as the number of expert-validated downstream labels increases. Note that the x-axis is on a log scale (N = 3, error bar = $\pm\sigma$).
Figure 5: Selected typical inference false positives per label type (the actual label is wrong but was inferred as correct). a, c, failed to differentiate between a Curb Ramp and a Missing Curb Ramp. b, labeled a drainage swale near an intersection as Curb Ramp. d, labeled Missing Curb Ramp where there is no sidewalk. e-j, label has attributes for a correct label but there is ample space for a wheelchair user to pass.
...and 5 more figures

LabelAId: Just-in-time AI Interventions for Improving Human Labeling Quality and Domain Knowledge in Crowdsourcing Systems

TL;DR

Abstract

LabelAId: Just-in-time AI Interventions for Improving Human Labeling Quality and Domain Knowledge in Crowdsourcing Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (10)