Table of Contents
Fetching ...

Comparing Zealous and Restrained AI Recommendations in a Real-World Human-AI Collaboration Task

Chengyuan Xu, Kuo-Chin Lien, Tobias Höllerer

TL;DR

It is argued that careful exploitation of this tradeoff can harness the complementary strengths in the human-AI collaboration to significantly improve team performance and point to important implications for the design of AI assistance in recall-demanding scenarios.

Abstract

When designing an AI-assisted decision-making system, there is often a tradeoff between precision and recall in the AI's recommendations. We argue that careful exploitation of this tradeoff can harness the complementary strengths in the human-AI collaboration to significantly improve team performance. We investigate a real-world video anonymization task for which recall is paramount and more costly to improve. We analyze the performance of 78 professional annotators working with a) no AI assistance, b) a high-precision "restrained" AI, and c) a high-recall "zealous" AI in over 3,466 person-hours of annotation work. In comparison, the zealous AI helps human teammates achieve significantly shorter task completion time and higher recall. In a follow-up study, we remove AI assistance for everyone and find negative training effects on annotators trained with the restrained AI. These findings and our analysis point to important implications for the design of AI assistance in recall-demanding scenarios.

Comparing Zealous and Restrained AI Recommendations in a Real-World Human-AI Collaboration Task

TL;DR

It is argued that careful exploitation of this tradeoff can harness the complementary strengths in the human-AI collaboration to significantly improve team performance and point to important implications for the design of AI assistance in recall-demanding scenarios.

Abstract

When designing an AI-assisted decision-making system, there is often a tradeoff between precision and recall in the AI's recommendations. We argue that careful exploitation of this tradeoff can harness the complementary strengths in the human-AI collaboration to significantly improve team performance. We investigate a real-world video anonymization task for which recall is paramount and more costly to improve. We analyze the performance of 78 professional annotators working with a) no AI assistance, b) a high-precision "restrained" AI, and c) a high-recall "zealous" AI in over 3,466 person-hours of annotation work. In comparison, the zealous AI helps human teammates achieve significantly shorter task completion time and higher recall. In a follow-up study, we remove AI assistance for everyone and find negative training effects on annotators trained with the restrained AI. These findings and our analysis point to important implications for the design of AI assistance in recall-demanding scenarios.

Paper Structure

This paper contains 21 sections, 2 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: Data processing workflow for Part 1 of the study and the annotation tool user interface. The two AI teammates share the same face detector, which generates bounding box face detections for each frame independently. The ByteTrack tracker zhang2022bytetrack and our proposed false-positive-robust (FPR) tracker define the restrained or zealous AI recommendations -- they track the per-frame detections temporally to pre-annotate the videos as shown above. For the human-only workflow, annotators must manually draw a box and adjust its size and location across many frames.
  • Figure 2: When reviewing the AI teammate's recommendations (green bounding boxes), a user takes one of the three actions for each box: accept, reject, or solve. In video annotation, because the boxes are temporally tracked across many frames, each action's time complexity is drastically different, note the two types of Solve in frame 0 can come at different cost, too. ID:1 -- A user can accept the true-positive track ID:1 boxes without any action. ID:2 -- The entire false-positive ID:2 track can be rejected with two mouse clicks by deleting the ID in any of the frames, which is $O(1)$ in time complexity. ID:3 -- False-positive recommendations, like track ID:3, are the most time-consuming to solve: the user can delete and redo this face, or manually adjust every frame until the AI's pre-annotation becomes acceptable with $\geq n$ mouse clicks, $O(n)$. ID:4 -- In frame 0, to solve the false-negative missing box for the left-most person, a user needs to manually draw a box and adjust its location and size until the AI-suggested box ID:4 comes in with $\leq n$ mouse clicks, $O(n)$ where $n$ is the number of frames.
  • Figure 3: Screenshot examples of Ego4D videos Ego4D2022CVPR used in our face annotation experiment. Easy videos include about one face to annotate in each non-empty frame. Medium videos include about two faces. Hard videos include three or more faces. Videos with more faces are expected to take longer time to finish. The study results show shorter to longer completion times for Easy, Medium, and Hard videos in both parts (see Figure \ref{['fig:part1_task_time']} and Figure \ref{['fig:part2_task_time']}), demonstrating that our video difficulty categorization is reasonable and performed as expected. We also considered scene diversity, box size (smaller faces are harder), and camera movement intensity (more movement is harder) to ensure a balanced difficulty distribution in selecting the specific videos.
  • Figure 4: Visualizing each group's overall annotation quality on the precision-recall plot with F1 scores (Part 1). Group A manually annotates all videos and without surprise, they are the slowest (Figure \ref{['fig:part1_task_time']}) with a quality better than Autonomous AI alone but worse than the two human-AI groups' team effort. Annotators in Groups B & C had to accept, reject, or solve the face boxes pre-annotated by the restrained or zealous AIs to improve the human-AI team performance. The arrows show how much humans improved from the AIs' initial annotation.
  • Figure 5: Average annotation time for a single video in Part 1. Lower is better. Error bars represent the 95% confidence interval. Treatment Group A used a baseline manual method and the annotators in Groups B and C reviewed restrained and zealous AI recommendations in Part 1. Groups B & C included the GPU time used to calculate the AI recommendations.
  • ...and 8 more figures