Exploring Empty Spaces: Human-in-the-Loop Data Augmentation
Catherine Yeh, Donghao Ren, Yannick Assogba, Dominik Moritz, Fred Hohman
TL;DR
Amplio tackles the challenge of augmenting unstructured text by identifying under-explored data regions in embedding space and filling them through three human-in-the-loop methods: Augment with Concepts, Augment by Interpolation, and Augment with LLM. The approach blends embedding inversion, SAE-derived concepts, and guided prompting to provide controllable, interpretable augmentation while maintaining data quality. A formative Apple study informed design goals, and a user study with 18 red teamers demonstrated Amplio’s ability to generate diverse, relevant safety prompts and to reveal distinct use cases for each method. The work suggests practical benefits for improving dataset diversity, offers insights into human-in-the-loop design and visualization-assisted augmentation, and outlines pathways for integrating such tools into real-world workflows for safer, more robust models.
Abstract
Data augmentation is crucial to make machine learning models more robust and safe. However, augmenting data can be challenging as it requires generating diverse data points to rigorously evaluate model behavior on edge cases and mitigate potential harms. Creating high-quality augmentations that cover these "unknown unknowns" is a time- and creativity-intensive task. In this work, we introduce Amplio, an interactive tool to help practitioners navigate "unknown unknowns" in unstructured text datasets and improve data diversity by systematically identifying empty data spaces to explore. Amplio includes three human-in-the-loop data augmentation techniques: Augment With Concepts, Augment by Interpolation, and Augment with Large Language Model. In a user study with 18 professional red teamers, we demonstrate the utility of our augmentation methods in helping generate high-quality, diverse, and relevant model safety prompts. We find that Amplio enabled red teamers to augment data quickly and creatively, highlighting the transformative potential of interactive augmentation workflows.
