Table of Contents
Fetching ...

Visual Data Diagnosis and Debiasing with Concept Graphs

Rwiddhi Chakraborty, Yinong Wang, Jialu Gao, Runkai Zheng, Cheng Zhang, Fernando De la Torre

TL;DR

ConBias is presented, a novel framework for diagnosing and mitigating Concept co-occurrence Biases in visual datasets, and it is shown that data augmentation based on a balanced concept distribution augmented by Conbias improves generalization performance across multiple datasets compared to state-of-the-art methods.

Abstract

The widespread success of deep learning models today is owed to the curation of extensive datasets significant in size and complexity. However, such models frequently pick up inherent biases in the data during the training process, leading to unreliable predictions. Diagnosing and debiasing datasets is thus a necessity to ensure reliable model performance. In this paper, we present ConBias, a novel framework for diagnosing and mitigating Concept co-occurrence Biases in visual datasets. ConBias represents visual datasets as knowledge graphs of concepts, enabling meticulous analysis of spurious concept co-occurrences to uncover concept imbalances across the whole dataset. Moreover, we show that by employing a novel clique-based concept balancing strategy, we can mitigate these imbalances, leading to enhanced performance on downstream tasks. Extensive experiments show that data augmentation based on a balanced concept distribution augmented by Conbias improves generalization performance across multiple datasets compared to state-of-the-art methods.

Visual Data Diagnosis and Debiasing with Concept Graphs

TL;DR

ConBias is presented, a novel framework for diagnosing and mitigating Concept co-occurrence Biases in visual datasets, and it is shown that data augmentation based on a balanced concept distribution augmented by Conbias improves generalization performance across multiple datasets compared to state-of-the-art methods.

Abstract

The widespread success of deep learning models today is owed to the curation of extensive datasets significant in size and complexity. However, such models frequently pick up inherent biases in the data during the training process, leading to unreliable predictions. Diagnosing and debiasing datasets is thus a necessity to ensure reliable model performance. In this paper, we present ConBias, a novel framework for diagnosing and mitigating Concept co-occurrence Biases in visual datasets. ConBias represents visual datasets as knowledge graphs of concepts, enabling meticulous analysis of spurious concept co-occurrences to uncover concept imbalances across the whole dataset. Moreover, we show that by employing a novel clique-based concept balancing strategy, we can mitigate these imbalances, leading to enhanced performance on downstream tasks. Extensive experiments show that data augmentation based on a balanced concept distribution augmented by Conbias improves generalization performance across multiple datasets compared to state-of-the-art methods.
Paper Structure (35 sections, 8 equations, 17 figures, 7 tables, 1 algorithm)

This paper contains 35 sections, 8 equations, 17 figures, 7 tables, 1 algorithm.

Figures (17)

  • Figure 1: The conventional data diagnosis and augmentation pipeline begins with an original (biased) dataset. Existing methods address these biases via object frequency calibration wang2022revise, metadata analysis dunlap2024diversify, or traditional augmentation techniques yun2019cutmixcubuk2020randaugment. In contrast, our framework models visual data as a knowledge graph of concepts, with orange nodes representing classes and blue nodes representing concepts, facilitating a systematic diagnosis of class-concept imbalances for debiasing object co-occurrences in vision datasets.
  • Figure 2: Overview of our framework ConBias. (a) Given a dataset and its concept metadata which contains the objects present in each image, (b) we build the concept graph using object co-occurrences. The line thickness indicates the co-occurrence frequencies of particular concepts with their respective classes. (c) Next, the clique-based sampling strategy generates under-represented class-concept combinations, which yield (d) the dataset diagnosis result. (e) Finally, with biases discovered, we generate images of classes containing under-represented concept combinations in the dataset with a standard text-to-image generative model.
  • Figure 3: Examples of concept clique sets for Landbird class in Waterbirds dataset uncovered by our diagnosis. Concepts such as Tree, Forest, Man, Woman, Bamboo are overwhelmingly associated with this class, indicating strong co-occurrence bias. All these concepts are causally unrelated to the bird type.
  • Figure 4: Examples of concept imbalances in the Waterbirds dataset. We show the frequencies of concepts cliques as discovered in the dataset. We see imbalances across not only single concepts (e.g., Ocean, Grass) but also concept combinations (e.g., (Beach, Ocean), (Tree, Forest)). These are the biases we aim to mitigate for the downstream task.
  • Figure 5: Performance on COCO-GB. We show the accuracies on (a) Class-Balanced (CB) and (b) Out-of-Distribution (OOD) splits. We observe that increasing number of images in $D_\text{aug}$ improves performance up to a certain point (1000 images).
  • ...and 12 more figures