Table of Contents
Fetching ...

Exploring the Lands Between: A Method for Finding Differences between AI-Decisions and Human Ratings through Generated Samples

Lukas Mecke, Daniel Buschek, Uwe Gruenefeld, Florian Alt

TL;DR

A method to find samples in the latent space of a generative model, designed to be challenging for a decision-making model with regard to matching human expectations, is proposed and applied to a face recognition model.

Abstract

Many important decisions in our everyday lives, such as authentication via biometric models, are made by Artificial Intelligence (AI) systems. These can be in poor alignment with human expectations, and testing them on clear-cut existing data may not be enough to uncover those cases. We propose a method to find samples in the latent space of a generative model, designed to be challenging for a decision-making model with regard to matching human expectations. By presenting those samples to both the decision-making model and human raters, we can identify areas where its decisions align with human intuition and where they contradict it. We apply this method to a face recognition model and collect a dataset of 11,200 human ratings from 100 participants. We discuss findings from our dataset and how our approach can be used to explore the performance of AI models in different contexts and for different user groups.

Exploring the Lands Between: A Method for Finding Differences between AI-Decisions and Human Ratings through Generated Samples

TL;DR

A method to find samples in the latent space of a generative model, designed to be challenging for a decision-making model with regard to matching human expectations, is proposed and applied to a face recognition model.

Abstract

Many important decisions in our everyday lives, such as authentication via biometric models, are made by Artificial Intelligence (AI) systems. These can be in poor alignment with human expectations, and testing them on clear-cut existing data may not be enough to uncover those cases. We propose a method to find samples in the latent space of a generative model, designed to be challenging for a decision-making model with regard to matching human expectations. By presenting those samples to both the decision-making model and human raters, we can identify areas where its decisions align with human intuition and where they contradict it. We apply this method to a face recognition model and collect a dataset of 11,200 human ratings from 100 participants. We discuss findings from our dataset and how our approach can be used to explore the performance of AI models in different contexts and for different user groups.
Paper Structure (30 sections, 5 figures, 2 tables)

This paper contains 30 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Illustration of how our proposed samples are generated (simplified illustration of a latent space). Positive samples are generated by sampling points close the the genuine sample. Negatives are random other points in the latent space. Optimized samples are generated starting at a negative sample and using an optimization function to find samples that are more similar to the genuine sample. Interpolation samples are found as steps on the latent path between the genuine sample and negative samples.
  • Figure 2: Screenshot of the main task in our online study. Participants were presented with two images (a base image on the left and a sample generated with our approach on the right) and were asked to rate their similarity and if the images showed the same person.
  • Figure 3: Distribution of perceived similarity based on the type of sample (left) and in comparison to the ratings of a face recognition model (right). Marker size indicates latent distance.
  • Figure 4: Violin plots showing the distribution of face recognition scores for different sample types split by the human identity rating (blue: rated as different by humans, orange: rated as the same person). The width of each violin is proportional to the number of observations. The left figure shows ratings and scores for all sample types, and the right figure gives more details on the different levels of interpolation.Rated identity with respect to the different types of samples and the associated face recognition score (left). Influence of the interpolation on rated identity (right). The blue line indicates the default distance for the face recognition model to accept a face.
  • Figure 5: Samples from our dataset with the biggest disagreement between model and participants (\ref{['fig:dis_dlib']}, \ref{['fig:dis_perc']}) and between participants themselves (\ref{['fig:dis_within_sim']}, \ref{['fig:dis_within_id']}). The top row in each figure contains base images and the bottom row contains generated alterations.