Table of Contents
Fetching ...

VLSlice: Interactive Vision-and-Language Slice Discovery

Eric Slyman, Minsuk Kahng, Stefan Lee

TL;DR

VLSlice is presented, an interactive system enabling user-guided discovery of coherent representation-level subgroups with consistent visiolinguistic behavior, denoted as vision-and-language slices, from unlabeled image sets, and it is shown that VLSlice enables users to quickly generate diverse high-coherency slices in a user study and released publicly.

Abstract

Recent work in vision-and-language demonstrates that large-scale pretraining can learn generalizable models that are efficiently transferable to downstream tasks. While this may improve dataset-scale aggregate metrics, analyzing performance around hand-crafted subgroups targeting specific bias dimensions reveals systemic undesirable behaviors. However, this subgroup analysis is frequently stalled by annotation efforts, which require extensive time and resources to collect the necessary data. Prior art attempts to automatically discover subgroups to circumvent these constraints but typically leverages model behavior on existing task-specific annotations and rapidly degrades on more complex inputs beyond "tabular" data, none of which study vision-and-language models. This paper presents VLSlice, an interactive system enabling user-guided discovery of coherent representation-level subgroups with consistent visiolinguistic behavior, denoted as vision-and-language slices, from unlabeled image sets. We show that VLSlice enables users to quickly generate diverse high-coherency slices in a user study (n=22) and release the tool publicly.

VLSlice: Interactive Vision-and-Language Slice Discovery

TL;DR

VLSlice is presented, an interactive system enabling user-guided discovery of coherent representation-level subgroups with consistent visiolinguistic behavior, denoted as vision-and-language slices, from unlabeled image sets, and it is shown that VLSlice enables users to quickly generate diverse high-coherency slices in a user study and released publicly.

Abstract

Recent work in vision-and-language demonstrates that large-scale pretraining can learn generalizable models that are efficiently transferable to downstream tasks. While this may improve dataset-scale aggregate metrics, analyzing performance around hand-crafted subgroups targeting specific bias dimensions reveals systemic undesirable behaviors. However, this subgroup analysis is frequently stalled by annotation efforts, which require extensive time and resources to collect the necessary data. Prior art attempts to automatically discover subgroups to circumvent these constraints but typically leverages model behavior on existing task-specific annotations and rapidly degrades on more complex inputs beyond "tabular" data, none of which study vision-and-language models. This paper presents VLSlice, an interactive system enabling user-guided discovery of coherent representation-level subgroups with consistent visiolinguistic behavior, denoted as vision-and-language slices, from unlabeled image sets. We show that VLSlice enables users to quickly generate diverse high-coherency slices in a user study (n=22) and release the tool publicly.
Paper Structure (14 sections, 4 equations, 11 figures, 4 tables)

This paper contains 14 sections, 4 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: An example user workflow with VLSlice. The user workflow begins with writing a (A) Query to the model then (B) Exploring the resulting visiolinguistic clusters to find interesting candidates to begin building a slice from. Once users identify a hypothesis, they can (C) Refine the clusters by gathering additional samples in a human-in-the-loop manner with VLSlice recommending similar and counterfactual examples to add to the clusters. Finally, users can (D) Validate the bias behavior of the model on this slice.
  • Figure 2: Sample rankings by baseline caption ($C_b$ = "person"), augmented caption ($C_a$ = "happy person"), and $\Delta C$, with highest on the left. The change in percentile from $C_b$ to $C_a$ is shown with green arrows for positive changes, red arrows for negative, and gray arrows for neutral. We enlarge the photo of people with smiling faces eating a meal. The rank of this photo does not change from $C_b$ to $C_a$ (4th), but increases (2nd) under $\Delta C$. Captions are prepended with "A photo of a " in practice.
  • Figure 3: Similar and counterfactual clusters for a slice capturing an unintended subset of "glasses" for the query $C_b =$ "A photo of a person", $C_a =$ "A photo of a CEO." While the similar clusters display additional masculine presenting glasses-wearers with positive $\Delta C$, counterfactuals help escape this region by displaying a cluster of feminine presenting glasses-wearers with opposing negative $\Delta C$.
  • Figure 4: Example slices created by participants for the Person/CEO task with VLSlice. In the "masculine glasses" slice (top), the participant identified that people wearing glasses with larger features or facial hair have a positive $\Delta C$, indicating a CEO-like bias. In contrast, the "people of color" (bottom) slice has a negative $\Delta C$, indicating bias against people with darker skin tones being CEO-like.
  • Figure 5: Additional $\Delta C$ ranking examples where $C_b$ and $C_a$ are the baseline and augmented captions, respectively.
  • ...and 6 more figures