Teaching AI to Handle Exceptions: Supervised Fine-Tuning with Human-Aligned Judgment

Matthew DosSantos DiSorbo; Harang Ju; Sinan Aral

Teaching AI to Handle Exceptions: Supervised Fine-Tuning with Human-Aligned Judgment

Matthew DosSantos DiSorbo, Harang Ju, Sinan Aral

TL;DR

The paper examines how large language models handle exceptions to policies and compares their decision-making to humans. It systematically evaluates three tuning approaches—ethical framework prompting, chain-of-thought prompting, and supervised fine-tuning with human explanations—across a suite of real-world scenarios requiring exceptions. The key finding is that supervised fine-tuning with explanations yields the strongest alignment with human judgments and demonstrates transfer to novel contexts, whereas EF prompting and CoT provide limited gains. This work highlights a scalable path toward human-aligned agentic AI and underscores the importance of training models on how decisions are made, not just what decisions are made, for reliable deployment in dynamic environments.

Abstract

Large language models (LLMs), initially developed for generative AI, are now evolving into agentic AI systems, which make decisions in complex, real-world contexts. Unfortunately, while their generative capabilities are well-documented, their decision-making processes remain poorly understood. This is particularly evident when testing targeted decision-making: for instance, how models handle exceptions, a critical and challenging aspect of decision-making made relevant by the inherent incompleteness of contracts. Here we demonstrate that LLMs, even ones that excel at reasoning, deviate significantly from human judgments because they adhere strictly to policies, even when such adherence is impractical, suboptimal, or even counterproductive. We then evaluate three approaches to tuning AI agents to handle exceptions: ethical framework prompting, chain-of-thought reasoning, and supervised fine-tuning. We find that while ethical framework prompting fails and chain-of-thought prompting provides only slight improvements, supervised fine-tuning - specifically with human explanations - yields markedly better results. Surprisingly, in our experiments, supervised fine-tuning even enabled models to generalize human-like decision-making to novel scenarios, demonstrating transfer learning of human-aligned decision-making across contexts. Furthermore, fine-tuning with explanations, not just labels, was critical for alignment, suggesting that aligning LLMs with human judgment requires explicit training on how decisions are made, not just which decisions are made. These findings highlight the need to address LLMs' shortcomings in handling exceptions in order to guide the development of agentic AI toward models that can effectively align with human judgment and simultaneously adapt to novel contexts.

Teaching AI to Handle Exceptions: Supervised Fine-Tuning with Human-Aligned Judgment

TL;DR

Abstract

Teaching AI to Handle Exceptions: Supervised Fine-Tuning with Human-Aligned Judgment

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (17)