Table of Contents
Fetching ...

Extracting Human Attention through Crowdsourced Patch Labeling

Minsuk Chang, Seokhyeon Park, Hyeon Jeon, Aeri Cho, Soohyun Lee, Jinwook Seo

TL;DR

A novel patch-labeling method that integrates AI assistance with crowdsourcing to capture human attention from images, which can be a viable solution for mitigating bias and improved classification accuracy and the refined focus of the model are demonstrated.

Abstract

In image classification, a significant problem arises from bias in the datasets. When it contains only specific types of images, the classifier begins to rely on shortcuts - simplistic and erroneous rules for decision-making. This leads to high performance on the training dataset but inferior results on new, varied images, as the classifier's generalization capability is reduced. For example, if the images labeled as mustache consist solely of male figures, the model may inadvertently learn to classify images by gender rather than the presence of a mustache. One approach to mitigate such biases is to direct the model's attention toward the target object's location, usually marked using bounding boxes or polygons for annotation. However, collecting such annotations requires substantial time and human effort. Therefore, we propose a novel patch-labeling method that integrates AI assistance with crowdsourcing to capture human attention from images, which can be a viable solution for mitigating bias. Our method consists of two steps. First, we extract the approximate location of a target using a pre-trained saliency detection model supplemented by human verification for accuracy. Then, we determine the human-attentive area in the image by iteratively dividing the image into smaller patches and employing crowdsourcing to ascertain whether each patch can be classified as the target object. We demonstrated the effectiveness of our method in mitigating bias through improved classification accuracy and the refined focus of the model. Also, crowdsourced experiments validate that our method collects human annotation up to 3.4 times faster than annotating object locations with polygons, significantly reducing the need for human resources. We conclude the paper by discussing the advantages of our method in a crowdsourcing context, mainly focusing on aspects of human errors and accessibility.

Extracting Human Attention through Crowdsourced Patch Labeling

TL;DR

A novel patch-labeling method that integrates AI assistance with crowdsourcing to capture human attention from images, which can be a viable solution for mitigating bias and improved classification accuracy and the refined focus of the model are demonstrated.

Abstract

In image classification, a significant problem arises from bias in the datasets. When it contains only specific types of images, the classifier begins to rely on shortcuts - simplistic and erroneous rules for decision-making. This leads to high performance on the training dataset but inferior results on new, varied images, as the classifier's generalization capability is reduced. For example, if the images labeled as mustache consist solely of male figures, the model may inadvertently learn to classify images by gender rather than the presence of a mustache. One approach to mitigate such biases is to direct the model's attention toward the target object's location, usually marked using bounding boxes or polygons for annotation. However, collecting such annotations requires substantial time and human effort. Therefore, we propose a novel patch-labeling method that integrates AI assistance with crowdsourcing to capture human attention from images, which can be a viable solution for mitigating bias. Our method consists of two steps. First, we extract the approximate location of a target using a pre-trained saliency detection model supplemented by human verification for accuracy. Then, we determine the human-attentive area in the image by iteratively dividing the image into smaller patches and employing crowdsourcing to ascertain whether each patch can be classified as the target object. We demonstrated the effectiveness of our method in mitigating bias through improved classification accuracy and the refined focus of the model. Also, crowdsourced experiments validate that our method collects human annotation up to 3.4 times faster than annotating object locations with polygons, significantly reducing the need for human resources. We conclude the paper by discussing the advantages of our method in a crowdsourcing context, mainly focusing on aspects of human errors and accessibility.
Paper Structure (46 sections, 4 equations, 10 figures, 3 tables, 2 algorithms)

This paper contains 46 sections, 4 equations, 10 figures, 3 tables, 2 algorithms.

Figures (10)

  • Figure 1: Identified problems of Segment Anything SegmentationAnything and SEEM SegmentationSEEM with images from CelebA and AwA2 dataset. Both models successfully segmented the animals in the typical background. However, they tend to merge the facial elements in the CelebA dataset, generating too coarse segments. Also, they produced scattered patches for marine animals in the AwA2 dataset or even failed to detect the animal.
  • Figure 2: Generated saliency maps according to each image size, which differs in the quality and dispersion of pieces. We merged two image sizes, 256 and 512, to generate a saliency map covering a broad range and significant sections. Other sizes (128 or 1024) are not used because they produce blank or too scattered maps.
  • Figure 3: Prolific survey interface for our experiment containing questions shown to workers. Both the (A) patch labeling interface and (B) polygon drawing interface are described. Instructions are placed on the side screen for constant reminders. Object labels (Cat) appear in the instructions in red. The image is annotated with blue boxes if the user clicks on it, while users can freely draw a polygon by clicking the points.
  • Figure 4: Target object is removed from the original image, only leaving the bias factors. Object auto removal service silverai is used for the generation.
  • Figure 5: Examples of the extracted Human Attention masks on each dataset. The mask is overlapped with the original image as a heatmap, and red indicates having greater attention from humans.
  • ...and 5 more figures