Table of Contents
Fetching ...

Answer, Refuse, or Guess? Investigating Risk-Aware Decision Making in Language Models

Cheng-Kuang Wu, Zhi Rui Tam, Chieh-Yen Lin, Yun-Nung Chen, Hung-yi Lee

TL;DR

The paper investigates how language models make risk-aware decisions to answer or defer in the face of uncertain consequences. It introduces an evaluation framework that varies human-defined risk structures $(r_{ ext{cor}}, r_{ ext{inc}}, r_{ ext{ref}})$ while keeping tasks fixed, measuring how well LM policies maximize expected reward. Across multiple datasets, models exhibit suboptimal behaviors, often over-answering in high-risk and over-deferring in low-risk scenarios, traced to difficulty in composing independent skills for decision making. A skill-decomposition approach implemented via prompt chaining—isolating downstream task solving, confidence estimation, and expected-value reasoning—consistently improves risk-aware decision policies, providing actionable guidance for deploying more reliable LM-based agents across diverse risk levels.

Abstract

Language models (LMs) are increasingly used to build agents that can act autonomously to achieve goals. During this automatic process, agents need to take a series of actions, some of which might lead to severe consequences if incorrect actions are taken. Therefore, such agents must sometimes defer-refusing to act when their confidence is insufficient-to avoid the potential cost of incorrect actions. Because the severity of consequences varies across applications, the tendency to defer should also vary: in low-risk settings agents should answer more freely, while in high-risk settings their decisions should be more conservative. We study this "answer-or-defer" problem with an evaluation framework that systematically varies human-specified risk structures-rewards and penalties for correct answers, incorrect answers, and refusals $(r_{\mathrm{cor}},r_{\mathrm{inc}}, r_{\mathrm{ref}})$-while keeping tasks fixed. This design evaluates LMs' risk-aware decision policies by measuring their ability to maximize expected reward. Across multiple datasets and models, we identify flaws in their decision policies: LMs tend to over-answer in high-risk settings and over-defer in low-risk settings. After analyzing the potential cause of such flaws, we find that a simple skill-decomposition method, which isolates the independent skills required for answer-or-defer decision making, can consistently improve LMs' decision policies. Our results highlight the current limitations of LMs in risk-conditioned decision making and provide practical guidance for deploying more reliable LM-based agents across applications of varying risk levels.

Answer, Refuse, or Guess? Investigating Risk-Aware Decision Making in Language Models

TL;DR

The paper investigates how language models make risk-aware decisions to answer or defer in the face of uncertain consequences. It introduces an evaluation framework that varies human-defined risk structures while keeping tasks fixed, measuring how well LM policies maximize expected reward. Across multiple datasets, models exhibit suboptimal behaviors, often over-answering in high-risk and over-deferring in low-risk scenarios, traced to difficulty in composing independent skills for decision making. A skill-decomposition approach implemented via prompt chaining—isolating downstream task solving, confidence estimation, and expected-value reasoning—consistently improves risk-aware decision policies, providing actionable guidance for deploying more reliable LM-based agents across diverse risk levels.

Abstract

Language models (LMs) are increasingly used to build agents that can act autonomously to achieve goals. During this automatic process, agents need to take a series of actions, some of which might lead to severe consequences if incorrect actions are taken. Therefore, such agents must sometimes defer-refusing to act when their confidence is insufficient-to avoid the potential cost of incorrect actions. Because the severity of consequences varies across applications, the tendency to defer should also vary: in low-risk settings agents should answer more freely, while in high-risk settings their decisions should be more conservative. We study this "answer-or-defer" problem with an evaluation framework that systematically varies human-specified risk structures-rewards and penalties for correct answers, incorrect answers, and refusals -while keeping tasks fixed. This design evaluates LMs' risk-aware decision policies by measuring their ability to maximize expected reward. Across multiple datasets and models, we identify flaws in their decision policies: LMs tend to over-answer in high-risk settings and over-defer in low-risk settings. After analyzing the potential cause of such flaws, we find that a simple skill-decomposition method, which isolates the independent skills required for answer-or-defer decision making, can consistently improve LMs' decision policies. Our results highlight the current limitations of LMs in risk-conditioned decision making and provide practical guidance for deploying more reliable LM-based agents across applications of varying risk levels.

Paper Structure

This paper contains 23 sections, 3 equations, 18 figures, 11 tables.

Figures (18)

  • Figure 1: Illustration of the answer-or-refuse problem under different risk structures. Depending on the application, the relative reward for correct answers and penalty for incorrect answers vary dramatically. For instance, brainstorming ideas is low-risk: one novel idea may be highly rewarding while bad ideas incur only minor costs. In contrast, deciding whether to conduct surgery is high-risk: a wrong decision brings severe consequences. Our central question is whether LMs can adapt their decision policies to maximize expected reward across such diverse scenarios. The shown reward–penalty ratios are illustrative examples rather than fixed values.
  • Figure 2: The risk informing prompt for our experiments.
  • Figure 3: Refusal proportions. Ideal refusal ratios represent the optimal decision-making policy.
  • Figure 4: Sampled LM outputs of using (left) versus NOT using (right) expected-value reasoning.
  • Figure 5: The pure gambling prompt to evaluate the skill of expected-value reasoning.
  • ...and 13 more figures