Table of Contents
Fetching ...

Human-in-the-Loop Synthetic Text Data Inspection with Provenance Tracking

Hong Jin Kang, Fabrice Harel-Canada, Muhammad Ali Gulzar, Violet Peng, Miryung Kim

TL;DR

This paper addresses the challenge of filtering low-quality texts produced by NLP data augmentation, where labels may be incorrect or texts garbled. It introduces INSPECTOR, a human-in-the-loop system that combines provenance tracking (transformation provenance and feature provenance) with assistive labeling (quality metrics and LLM predictions) to streamline data inspection. In a within-subject study with 15 participants across SST2 and TweetEval, INSPECTOR yielded 3x–4x more correct-label texts and improved perceived confidence, while also enhancing model robustness by up to 32% on adversarial attacks. The findings highlight the value of combining provenance-based grouping with assistive labeling, while noting that linguistic feature provenance was less helpful, and the approach is open-source for broader adoption.

Abstract

Data augmentation techniques apply transformations to existing texts to generate additional data. The transformations may produce low-quality texts, where the meaning of the text is changed and the text may even be mangled beyond human comprehension. Analyzing the synthetically generated texts and their corresponding labels is slow and demanding. To winnow out texts with incorrect labels, we develop INSPECTOR, a human-in-the-loop data inspection technique. INSPECTOR combines the strengths of provenance tracking techniques with assistive labeling. INSPECTOR allows users to group related texts by their transformation provenance, i.e., the transformations applied to the original text, or feature provenance, the linguistic features of the original text. For assistive labeling, INSPECTOR computes metrics that approximate data quality, and allows users to compare the corresponding label of each text against the predictions of a large language model. In a user study, INSPECTOR increases the number of texts with correct labels identified by 3X on a sentiment analysis task and by 4X on a hate speech detection task. The participants found grouping the synthetically generated texts by their common transformation to be the most useful technique. Surprisingly, grouping texts by common linguistic features was perceived to be unhelpful. Contrary to prior work, our study finds that no single technique obviates the need for human inspection effort. This validates the design of INSPECTOR which combines both analysis of data provenance and assistive labeling to reduce human inspection effort.

Human-in-the-Loop Synthetic Text Data Inspection with Provenance Tracking

TL;DR

This paper addresses the challenge of filtering low-quality texts produced by NLP data augmentation, where labels may be incorrect or texts garbled. It introduces INSPECTOR, a human-in-the-loop system that combines provenance tracking (transformation provenance and feature provenance) with assistive labeling (quality metrics and LLM predictions) to streamline data inspection. In a within-subject study with 15 participants across SST2 and TweetEval, INSPECTOR yielded 3x–4x more correct-label texts and improved perceived confidence, while also enhancing model robustness by up to 32% on adversarial attacks. The findings highlight the value of combining provenance-based grouping with assistive labeling, while noting that linguistic feature provenance was less helpful, and the approach is open-source for broader adoption.

Abstract

Data augmentation techniques apply transformations to existing texts to generate additional data. The transformations may produce low-quality texts, where the meaning of the text is changed and the text may even be mangled beyond human comprehension. Analyzing the synthetically generated texts and their corresponding labels is slow and demanding. To winnow out texts with incorrect labels, we develop INSPECTOR, a human-in-the-loop data inspection technique. INSPECTOR combines the strengths of provenance tracking techniques with assistive labeling. INSPECTOR allows users to group related texts by their transformation provenance, i.e., the transformations applied to the original text, or feature provenance, the linguistic features of the original text. For assistive labeling, INSPECTOR computes metrics that approximate data quality, and allows users to compare the corresponding label of each text against the predictions of a large language model. In a user study, INSPECTOR increases the number of texts with correct labels identified by 3X on a sentiment analysis task and by 4X on a hate speech detection task. The participants found grouping the synthetically generated texts by their common transformation to be the most useful technique. Surprisingly, grouping texts by common linguistic features was perceived to be unhelpful. Contrary to prior work, our study finds that no single technique obviates the need for human inspection effort. This validates the design of INSPECTOR which combines both analysis of data provenance and assistive labeling to reduce human inspection effort.
Paper Structure (22 sections, 8 figures, 5 tables)

This paper contains 22 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Examples of transformed texts from the SST2 movie review dataset generated during data augmentation. A transformed text can contain garbled text, or have an inappropriate label. As an example, the "Word Deletion" transformation can mangle the text "ends up being surprisingly dull" into "up being surprising", causing its corresponding label "-" (indicating a negative sentiment) to no longer be appropriate. Of the four examples of synthetically generated texts, only one ("the event is beautiful to see") has an appropriate label.
  • Figure 2: Inspector: The user alternates between (1) inspecting the provenance of groups of texts and labels following their (A) common transformation, and (B) common linguistic features, and (2) inspecting individual transformed texts with their corresponding labels, with assistive labeling using (C) the quality metrics, alignment, grammaticality, fluency scores, and (D) LLM predictions.
  • Figure 3: Transform provenance. A user selects texts and inspects the common transforms (e.g., RandomCharSubset) in the transformation provenance pane with their (A) inspection statistics (e.g., the user has inspected 14 texts, with 11 marked as high quality), and (B) view other texts sharing the same transform. A user can then (C) mark all instances sharing the same transform to be correct, obviating the need for inspecting individual texts one by one.
  • Figure 4: Feature provenance. A user can select texts, and can inspect linguistic features common to the selected texts (e.g., "Has a description of a location") in the transformation provenance pane with their inspection statistics (e.g., the user has inspected 24 texts, with 11 marked as high quality). Then, a user can mark all instances sharing the same feature to be correct.
  • Figure 5: The workflow of a user inspecting data using Inspector.
  • ...and 3 more figures