Table of Contents
Fetching ...

Defining and Evaluating Decision and Composite Risk in Language Models Applied to Natural Language Inference

Ke Shen, Mayank Kejriwal

TL;DR

The paper introduces a risk-centric evaluation framework for language models applied to Natural Language Inference, distinguishing two risk types: decision risk and composite risk. It formalizes a two-level inference with an external decision rule (DwD) and a selection rule, enabling robust risk-adjusted calibration for both discriminative and generative LMs. Central contributions include Risk Injection Functions to create ambiguous risk scenarios, a synthetic-feature-based random-forest DwD module, and novel metrics (P_spe, P_sen, RRR) to quantify composite risk, evaluated on four NLI benchmarks with RoBERTa and GPT-3.5-Turbo. Empirical results show DwD reduces decision risk by up to 25.3% and composite risk by up to 16.6%, including generalization to black-box models like ChatGPT and resilience in choice-overload settings. The work advances reliable inference by providing a practical, LM-agnostic risk calibration framework with clear metrics and interpretable results for real-world deployment.

Abstract

Despite their impressive performance, large language models (LLMs) such as ChatGPT are known to pose important risks. One such set of risks arises from misplaced confidence, whether over-confidence or under-confidence, that the models have in their inference. While the former is well studied, the latter is not, leading to an asymmetry in understanding the comprehensive risk of the model based on misplaced confidence. In this paper, we address this asymmetry by defining two types of risk (decision and composite risk), and proposing an experimental framework consisting of a two-level inference architecture and appropriate metrics for measuring such risks in both discriminative and generative LLMs. The first level relies on a decision rule that determines whether the underlying language model should abstain from inference. The second level (which applies if the model does not abstain) is the model's inference. Detailed experiments on four natural language commonsense reasoning datasets using both an open-source ensemble-based RoBERTa model and ChatGPT, demonstrate the practical utility of the evaluation framework. For example, our results show that our framework can get an LLM to confidently respond to an extra 20.1% of low-risk inference tasks that other methods might misclassify as high-risk, and skip 19.8% of high-risk tasks, which would have been answered incorrectly.

Defining and Evaluating Decision and Composite Risk in Language Models Applied to Natural Language Inference

TL;DR

The paper introduces a risk-centric evaluation framework for language models applied to Natural Language Inference, distinguishing two risk types: decision risk and composite risk. It formalizes a two-level inference with an external decision rule (DwD) and a selection rule, enabling robust risk-adjusted calibration for both discriminative and generative LMs. Central contributions include Risk Injection Functions to create ambiguous risk scenarios, a synthetic-feature-based random-forest DwD module, and novel metrics (P_spe, P_sen, RRR) to quantify composite risk, evaluated on four NLI benchmarks with RoBERTa and GPT-3.5-Turbo. Empirical results show DwD reduces decision risk by up to 25.3% and composite risk by up to 16.6%, including generalization to black-box models like ChatGPT and resilience in choice-overload settings. The work advances reliable inference by providing a practical, LM-agnostic risk calibration framework with clear metrics and interpretable results for real-world deployment.

Abstract

Despite their impressive performance, large language models (LLMs) such as ChatGPT are known to pose important risks. One such set of risks arises from misplaced confidence, whether over-confidence or under-confidence, that the models have in their inference. While the former is well studied, the latter is not, leading to an asymmetry in understanding the comprehensive risk of the model based on misplaced confidence. In this paper, we address this asymmetry by defining two types of risk (decision and composite risk), and proposing an experimental framework consisting of a two-level inference architecture and appropriate metrics for measuring such risks in both discriminative and generative LLMs. The first level relies on a decision rule that determines whether the underlying language model should abstain from inference. The second level (which applies if the model does not abstain) is the model's inference. Detailed experiments on four natural language commonsense reasoning datasets using both an open-source ensemble-based RoBERTa model and ChatGPT, demonstrate the practical utility of the evaluation framework. For example, our results show that our framework can get an LLM to confidently respond to an extra 20.1% of low-risk inference tasks that other methods might misclassify as high-risk, and skip 19.8% of high-risk tasks, which would have been answered incorrectly.
Paper Structure (9 sections, 6 equations, 7 figures, 6 tables)

This paper contains 9 sections, 6 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Illustration of (a) decision risk and (b) composite risk (b) in LMs in NLI tasks.
  • Figure 2: The risk-centric framework for evaluating LLMs on NLI tasks. Symbols used in the figure are further described in the main text.
  • Figure 3: Risk-coverage curves for RoBERTa ensemble model that uses the proposed DwD method (WQ- and NRA-trained versions) as a decision rule on all four benchmarks.
  • Figure 4: Examples of choice overload in inference scenarios using random and heuristic sampling methods. The correct answers are highlighted.
  • Figure 5: Accuracy of RoBERTa (top) and GPT-3.5-Turbo (bottom) across four benchmarks under various choice paralysis settings. Error bars represent the performance range of the optimal DwD decision rule, which was trained on a training set from one of the benchmarks perturbed by one of the risk injection functions. Legend details include both choice paralysis settings and the corresponding top-performing DwD configurations.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Definition 1: Decision Risk
  • Definition 2: Composite Risk