Table of Contents
Fetching ...

Exploring Empty Spaces: Human-in-the-Loop Data Augmentation

Catherine Yeh, Donghao Ren, Yannick Assogba, Dominik Moritz, Fred Hohman

TL;DR

Amplio tackles the challenge of augmenting unstructured text by identifying under-explored data regions in embedding space and filling them through three human-in-the-loop methods: Augment with Concepts, Augment by Interpolation, and Augment with LLM. The approach blends embedding inversion, SAE-derived concepts, and guided prompting to provide controllable, interpretable augmentation while maintaining data quality. A formative Apple study informed design goals, and a user study with 18 red teamers demonstrated Amplio’s ability to generate diverse, relevant safety prompts and to reveal distinct use cases for each method. The work suggests practical benefits for improving dataset diversity, offers insights into human-in-the-loop design and visualization-assisted augmentation, and outlines pathways for integrating such tools into real-world workflows for safer, more robust models.

Abstract

Data augmentation is crucial to make machine learning models more robust and safe. However, augmenting data can be challenging as it requires generating diverse data points to rigorously evaluate model behavior on edge cases and mitigate potential harms. Creating high-quality augmentations that cover these "unknown unknowns" is a time- and creativity-intensive task. In this work, we introduce Amplio, an interactive tool to help practitioners navigate "unknown unknowns" in unstructured text datasets and improve data diversity by systematically identifying empty data spaces to explore. Amplio includes three human-in-the-loop data augmentation techniques: Augment With Concepts, Augment by Interpolation, and Augment with Large Language Model. In a user study with 18 professional red teamers, we demonstrate the utility of our augmentation methods in helping generate high-quality, diverse, and relevant model safety prompts. We find that Amplio enabled red teamers to augment data quickly and creatively, highlighting the transformative potential of interactive augmentation workflows.

Exploring Empty Spaces: Human-in-the-Loop Data Augmentation

TL;DR

Amplio tackles the challenge of augmenting unstructured text by identifying under-explored data regions in embedding space and filling them through three human-in-the-loop methods: Augment with Concepts, Augment by Interpolation, and Augment with LLM. The approach blends embedding inversion, SAE-derived concepts, and guided prompting to provide controllable, interpretable augmentation while maintaining data quality. A formative Apple study informed design goals, and a user study with 18 red teamers demonstrated Amplio’s ability to generate diverse, relevant safety prompts and to reveal distinct use cases for each method. The work suggests practical benefits for improving dataset diversity, offers insights into human-in-the-loop design and visualization-assisted augmentation, and outlines pathways for integrating such tools into real-world workflows for safer, more robust models.

Abstract

Data augmentation is crucial to make machine learning models more robust and safe. However, augmenting data can be challenging as it requires generating diverse data points to rigorously evaluate model behavior on edge cases and mitigate potential harms. Creating high-quality augmentations that cover these "unknown unknowns" is a time- and creativity-intensive task. In this work, we introduce Amplio, an interactive tool to help practitioners navigate "unknown unknowns" in unstructured text datasets and improve data diversity by systematically identifying empty data spaces to explore. Amplio includes three human-in-the-loop data augmentation techniques: Augment With Concepts, Augment by Interpolation, and Augment with Large Language Model. In a user study with 18 professional red teamers, we demonstrate the utility of our augmentation methods in helping generate high-quality, diverse, and relevant model safety prompts. We find that Amplio enabled red teamers to augment data quickly and creatively, highlighting the transformative potential of interactive augmentation workflows.
Paper Structure (68 sections, 2 equations, 8 figures, 2 tables)

This paper contains 68 sections, 2 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Our system, Amplio, aims to provide a middle ground between freeform and structured text augmentation.
  • Figure 2: With our interface, ML practitioners can quickly get an overview of their dataset in three ways. (A) First, users can hover over points in the main embedding visualization and view information about the corresponding sentence. (B) The Left Sidebar includes summary statistics and interactive visualizations that can be used to filter the data by sentence type, category, or length. (C) In the Data Explorer view, users can search for specific data instances with a searchable table.
  • Figure 3: When a user clicks on a point, the data augmentation panel will open on the right. Here, users can choose an augmentation approach. (A) Our first method, Augment with Concepts will suggest relevant concepts, which can be added or subtracted from the current sentence by adjusting the weight sliders. (B) Second, to Augment by Interpolation, users can select a second sentence to interpolate with to generate new variations. (C) Finally, users can Augment with Large Language Model by entering their own prompt, or selecting an prompt idea from the provided list of contextualized suggestions. (D) Below each augmentation method, users can set how many new sentences they would like to generate.
  • Figure 4: Drawing an arrow between sentences to Augment by Interpolation. Orange points represent interpolation suggestions automatically chosen by Amplio.
  • Figure 5: Sample results from Augment with Concepts. (A) After augmentation is complete, the new points will be projected onto the embedding visualization in dark blue. (B) All generated "child" sentences for the current "parent" sentence are also visible in a searchable table in the right panel. (C) To view all generated sentences across the whole dataset, users can click the history tab to the left of the augmentation panel. This opens a similar but extended table view as (B).
  • ...and 3 more figures