Systematic Characterization of the Effectiveness of Alignment in Large Language Models for Categorical Decisions

Isaac Kohane

Systematic Characterization of the Effectiveness of Alignment in Large Language Models for Categorical Decisions

Isaac Kohane

TL;DR

A systematic methodology for evaluating preference alignment in LLMs on categorical decision-making with medical triage with medical triage as a domain-specific use case is applied and a novel simple measure, the Alignment Compliance Index (ACI), is introduced that quantifies how effectively a LLM can be aligned to a given preference function or gold standard.

Abstract

As large language models (LLMs) are deployed in high-stakes domains like healthcare, understanding how well their decision-making aligns with human preferences and values becomes crucial, especially when we recognize that there is no single gold standard for these preferences. This paper applies a systematic methodology for evaluating preference alignment in LLMs on categorical decision-making with medical triage as a domain-specific use case. It also measures how effectively an alignment procedure will change the alignment of a specific model. Key to this methodology is a novel simple measure, the Alignment Compliance Index (ACI), that quantifies how effectively a LLM can be aligned to a given preference function or gold standard. Since the ACI measures the effect rather than the process of alignment, it is applicable to alignment methods beyond the in-context learning used in this study. Using a dataset of simulated patient pairs, three frontier LLMs (GPT4o, Claude 3.5 Sonnet, and Gemini Advanced) were assessed on their ability to make triage decisions consistent with an expert clinician's preferences. The models' performance before and after alignment attempts was evaluated using various prompting strategies. The results reveal significant variability in alignment effectiveness across models and alignment approaches. Notably, models that performed well, as measured by ACI, pre-alignment sometimes degraded post-alignment, and small changes in the target preference function led to large shifts in model rankings. The implicit ethical principles, as understood by humans, underlying the LLMs' decisions were also explored through targeted questioning. This study motivates the use of a practical set of methods and the ACI, in the near term, to understand the correspondence between the variety of human and LLM decision-making values in categorical decision-making such as triage.

Systematic Characterization of the Effectiveness of Alignment in Large Language Models for Categorical Decisions

TL;DR

Abstract

Paper Structure (26 sections, 7 equations, 6 figures, 7 tables)

This paper contains 26 sections, 7 equations, 6 figures, 7 tables.

Introduction
Methods
Q1 Methods
Q2 Methods
Q3 Methods
Q4 Methods: Explicit Debrief on Decision-making
Q5 Methods: Changes in Alignment
Q6 Methods: Quantifying Alignment for the Triage Task
Pairwise Consistency Between Runs
Results
Q1 Results
Q2 Results: Concordance with in-context alignment
Q3(i) Results: Generalization from groups
Q3(ii) Results: Generalization from implicit groups aligned.
Q3(iii) Results: Generalization: Single Attribute Dominance (QALYs)
...and 11 more sections

Figures (6)

Figure 1: Heatmap for 3 triage decisions by 3 LLM and 1 human. The Gemini Enhanced and GPT4o models have the highest non-concordance (i.e. $\kappa$$<$ 0, colored blue) with each other, but also the highest concordance (see the deep red off-diagonal red squares)
Figure 2: Heatmap for 3 aligned triage decisions by 3 LLM and 1 human. Claude Sonnet 3.5 was most discordant with itself (run 1 vs run 3) followed by Gemini Enhanced. GPT4o was the only model to have only positive $\kappa$ scores with itself
Figure 3: Heatmap for 3 aligned triage decisions by 3 LLM and 1 human. The concordance of all runs are positive (at least $\kappa$ of 0.2) unlike in Q1 and Q2. Claude Sonnet is the most concordant with the human expert and also the most internally consistent in 3 runs
Figure 4: Heatmap for 3 aligned triage decisions by 3 LLM and 1 human. This is the same 20 pairs tested in Q3(i) but the LLM was first given 81 pairs to quantify inequalities to align them with the expert. Compared to the prior concordance heatmap, there is increased concordance (and internal consistency) in GPT4o and Claude Sonnet 2.0 remains at the maximum concordance. The lower concordances with Gemini Enhanced do not show improvement overall.
Figure 5: Concordance of triage decisions aligned to maximize QALY.
...and 1 more figures

Systematic Characterization of the Effectiveness of Alignment in Large Language Models for Categorical Decisions

TL;DR

Abstract

Systematic Characterization of the Effectiveness of Alignment in Large Language Models for Categorical Decisions

Authors

TL;DR

Abstract

Table of Contents

Figures (6)