Multi-Round Human-AI Collaboration with User-Specified Requirements

Sima Noorani; Shayan Kiyani; Hamed Hassani; George Pappas

Multi-Round Human-AI Collaboration with User-Specified Requirements

Sima Noorani, Shayan Kiyani, Hamed Hassani, George Pappas

TL;DR

This work adopts a human centric view governed by two principles: counterfactual harm, ensuring the AI does not undermine human strengths, and complementarity, ensuring it adds value where the human is prone to err, and introduces an online, distribution free algorithm that enforces the user-specified constraints over the collaboration dynamics.

Abstract

As humans increasingly rely on multiround conversational AI for high stakes decisions, principled frameworks are needed to ensure such interactions reliably improve decision quality. We adopt a human centric view governed by two principles: counterfactual harm, ensuring the AI does not undermine human strengths, and complementarity, ensuring it adds value where the human is prone to err. We formalize these concepts via user defined rules, allowing users to specify exactly what harm and complementarity mean for their specific task. We then introduce an online, distribution free algorithm with finite sample guarantees that enforces the user-specified constraints over the collaboration dynamics. We evaluate our framework across two interactive settings: LLM simulated collaboration on a medical diagnostic task and a human crowdsourcing study on a pictorial reasoning task. We show that our online procedure maintains prescribed counterfactual harm and complementarity violation rates even under nonstationary interaction dynamics. Moreover, tightening or loosening these constraints produces predictable shifts in downstream human accuracy, confirming that the two principles serve as practical levers for steering multi-round collaboration toward better decision quality without the need to model or constrain human behavior.

Multi-Round Human-AI Collaboration with User-Specified Requirements

TL;DR

Abstract

Paper Structure (48 sections, 2 theorems, 27 equations, 12 figures, 1 table)

This paper contains 48 sections, 2 theorems, 27 equations, 12 figures, 1 table.

Introduction
Related Works
Background: Single-Round Collaborative Prediction Sets
Counterfactual Harm.
Complementarity.
Problem Formulation: Multi-Round Collaboration
Interaction Protocol
User-Defined Collaboration Requirements
Rule-Based Formulation.
Algorithm and Guarantees
Experiments
Instrantiation of Rules and Score Functions.
LLM-Simulated Experiments
Experimental Design: Medical Diagnosis.
Empirical Convergence.
...and 33 more sections

Key Result

Theorem 3.1

The solution to the optimization problem single_round that minimizes $\mathbb{E}[|C(X)|]$ is of the form

Figures (12)

Figure 1: Empirical convergence of error rates in the LLM-simulated medical task for complementarity (left) and counterfactual harm (right). Both plots track the cumulative average running error, $\text{AvgError}_t = \frac{1}{t} \sum_{i=1}^t \mathbf{1}\{\text{Error}_i\}$, over sequential trials.
Figure 2: Human decision outcomes across varying AI error targets in the LLM-simulated medical task. (Left) Human GT gain rate as a function of the complementarity error target ($\varepsilon_{\text{COMP}}$). (Right) Rate of abandoning a correct initial guess (GT loss) as a function of the counterfactual harm error target ($\varepsilon_{\text{CH}}$).
Figure 3: Error convergence for Algorithm B in the crowdsourcing study. Both plots track the cumulative average running error, $\text{AvgError}_t = \frac{1}{t} \sum_{i=1}^t \mathbf{1}\{\text{Error}_i\}$, where observed rates for counterfactual harm (left) and complementarity (right) stabilize at their nominal values.
Figure 4: Behavioral impact of AI errors in the crowdsourcing task across Algorithms A, B, and C. Top row: human GT loss rates compared across trials with and without CH errors. Bottom row: human GT gain rates compared across trials with and without COMP errors.
Figure 5: Impact of nominal error rates on human outcomes: (a) Comparison of human's GT loss rates between Algorithm A ($\varepsilon_{\text{CH}}=0.05$) and Algorithm B ($\varepsilon_{\text{CH}}=0.30$). (b) Comparison of human's GT gain rates between Algorithm A ($\varepsilon_{\text{COMP}}=0.50$) and Algorithm C ($\varepsilon_{\text{COMP}}=0.70$).
...and 7 more figures

Theorems & Definitions (4)

Theorem 3.1
Remark 5.1
Theorem 5.2: Finite-sample online control of collaboration errors
proof : Proof of Theorem \ref{['thm_finite']}

Multi-Round Human-AI Collaboration with User-Specified Requirements

TL;DR

Abstract

Multi-Round Human-AI Collaboration with User-Specified Requirements

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (4)