Table of Contents
Fetching ...

Teaching AI to Handle Exceptions: Supervised Fine-Tuning with Human-Aligned Judgment

Matthew DosSantos DiSorbo, Harang Ju, Sinan Aral

TL;DR

The paper examines how large language models handle exceptions to policies and compares their decision-making to humans. It systematically evaluates three tuning approaches—ethical framework prompting, chain-of-thought prompting, and supervised fine-tuning with human explanations—across a suite of real-world scenarios requiring exceptions. The key finding is that supervised fine-tuning with explanations yields the strongest alignment with human judgments and demonstrates transfer to novel contexts, whereas EF prompting and CoT provide limited gains. This work highlights a scalable path toward human-aligned agentic AI and underscores the importance of training models on how decisions are made, not just what decisions are made, for reliable deployment in dynamic environments.

Abstract

Large language models (LLMs), initially developed for generative AI, are now evolving into agentic AI systems, which make decisions in complex, real-world contexts. Unfortunately, while their generative capabilities are well-documented, their decision-making processes remain poorly understood. This is particularly evident when testing targeted decision-making: for instance, how models handle exceptions, a critical and challenging aspect of decision-making made relevant by the inherent incompleteness of contracts. Here we demonstrate that LLMs, even ones that excel at reasoning, deviate significantly from human judgments because they adhere strictly to policies, even when such adherence is impractical, suboptimal, or even counterproductive. We then evaluate three approaches to tuning AI agents to handle exceptions: ethical framework prompting, chain-of-thought reasoning, and supervised fine-tuning. We find that while ethical framework prompting fails and chain-of-thought prompting provides only slight improvements, supervised fine-tuning - specifically with human explanations - yields markedly better results. Surprisingly, in our experiments, supervised fine-tuning even enabled models to generalize human-like decision-making to novel scenarios, demonstrating transfer learning of human-aligned decision-making across contexts. Furthermore, fine-tuning with explanations, not just labels, was critical for alignment, suggesting that aligning LLMs with human judgment requires explicit training on how decisions are made, not just which decisions are made. These findings highlight the need to address LLMs' shortcomings in handling exceptions in order to guide the development of agentic AI toward models that can effectively align with human judgment and simultaneously adapt to novel contexts.

Teaching AI to Handle Exceptions: Supervised Fine-Tuning with Human-Aligned Judgment

TL;DR

The paper examines how large language models handle exceptions to policies and compares their decision-making to humans. It systematically evaluates three tuning approaches—ethical framework prompting, chain-of-thought prompting, and supervised fine-tuning with human explanations—across a suite of real-world scenarios requiring exceptions. The key finding is that supervised fine-tuning with explanations yields the strongest alignment with human judgments and demonstrates transfer to novel contexts, whereas EF prompting and CoT provide limited gains. This work highlights a scalable path toward human-aligned agentic AI and underscores the importance of training models on how decisions are made, not just what decisions are made, for reliable deployment in dynamic environments.

Abstract

Large language models (LLMs), initially developed for generative AI, are now evolving into agentic AI systems, which make decisions in complex, real-world contexts. Unfortunately, while their generative capabilities are well-documented, their decision-making processes remain poorly understood. This is particularly evident when testing targeted decision-making: for instance, how models handle exceptions, a critical and challenging aspect of decision-making made relevant by the inherent incompleteness of contracts. Here we demonstrate that LLMs, even ones that excel at reasoning, deviate significantly from human judgments because they adhere strictly to policies, even when such adherence is impractical, suboptimal, or even counterproductive. We then evaluate three approaches to tuning AI agents to handle exceptions: ethical framework prompting, chain-of-thought reasoning, and supervised fine-tuning. We find that while ethical framework prompting fails and chain-of-thought prompting provides only slight improvements, supervised fine-tuning - specifically with human explanations - yields markedly better results. Surprisingly, in our experiments, supervised fine-tuning even enabled models to generalize human-like decision-making to novel scenarios, demonstrating transfer learning of human-aligned decision-making across contexts. Furthermore, fine-tuning with explanations, not just labels, was critical for alignment, suggesting that aligning LLMs with human judgment requires explicit training on how decisions are made, not just which decisions are made. These findings highlight the need to address LLMs' shortcomings in handling exceptions in order to guide the development of agentic AI toward models that can effectively align with human judgment and simultaneously adapt to novel contexts.

Paper Structure

This paper contains 33 sections, 1 equation, 17 figures, 1 table.

Figures (17)

  • Figure 1: Baseline Refusal Rates for LLM and Human Decision-Makers Across Scenarios Refusal rates across multiple exception-handling scenarios, comparing responses from Claude, Gemini, Llama and OpenAI models to $303$ human participants. For each scenario, LLMs and humans were introduced to decision-making scenarios with policy constraints. They were then asked whether a policy exception should be granted — the level to which the exception violates the policy varies (i.e., exceeding a price limit by $15, exceeding a price limit by $10, etc.); each human responded to one LEVEL for each scenario. In general, LLMs overwhelmingly refused to grant exceptions, while humans exhibited greater flexibility, especially for low-severity violations (e.g., exceeding a price limit by $0.01). LLM results are aggregated across models; for example, the Claude results are a weighted average of Claude Opus 4, Sonnet 4 and Haiku 3.5. $\pm$ 1 standard error bars are included (variance is pooled across models).
  • Figure 2: Exception Handling Across Ethical Frameworks Comparison of LLM refusal rates when prompted to reason using virtue ethics. While responses are generally more flexible compared to LLM reasoning without an ethical framework, LLM refusal rates are still broadly different from human refusal rates, across scenarios, levels of exception, and frameworks — similar results hold for consequentialist and deontological frameworks, which are not depicted here. The results suggest that guiding an LLM to reason under an ethical framework will not result in human-aligned judgment. $\pm 1$ standard error bars are included.
  • Figure 3: Effects of Supervised Fine-Tuning with Binary Labels on Exception Handling Comparison of GPT-4o and Gemini 2.5 Flash refusal rates after supervised fine-tuning (SFT) with binary (yes-or-no) human responses. Baseline models (not fine-tuned), as well as GPT-4o fine-tuned with binary human responses, overwhelmingly refuse exceptions. However, Gemini 2.5 Flash fine-tuned with binary human responses displayed increased flexibility and alignment with human judgment. The results suggest that training with binary labels can be — but is not always — effective for enabling nuanced decision-making in agentic AI systems. Both GPT-4o and Gemini 2.5 Flash were fine-tuned using $n = 303$ binary yes-or-no human responses for each scenario. $\pm 1$ standard error bars are included.
  • Figure 4: Effects of Supervised Fine-Tuning with Full Human Responses on Exception Handling Comparison of GPT-4o and Gemini 2.5 Flash refusal rates after supervised fine-tuning (SFT) with full human responses. Baseline models (not fine-tuned) overwhelmingly refused exceptions. However, both GPT-4o and Gemini 2.5 Flash fine-tuned with full human responses displayed increased flexibility and alignment with human judgment. The results suggest that training with full human responses may be an effective method for enabling nuanced decision-making in agentic AI systems — potentially more effective than training with binary labels, as was the case with GPT-4o. GPT-4o was fined-tuned with $n = 50$, and Gemini 2.5 Flash with $n = 303$, full human responses for each scenario. $\pm 1$ standard error bars are included.
  • Figure 5: Effects of Supervised Fine-Tuning on Transfer Learning GPT-4o and Gemini 2.5 Flash refusal rates on novel scenarios after supervised fine-tuning (SFT) with full human explanations. Interestingly, models fine-tuned with full human explanations exhibit improved alignment with human judgment — even when prompted with decision-making scenarios distinct from the scenarios they were trained on. The results suggest that the SFT engenders the potential for transfer learning: LLMs apply learned reasoning patterns to novel contexts, which results in more nuanced decision-making. $\pm 1$ standard error bars are included.
  • ...and 12 more figures