Evaluating language models as risk scores

André F. Cruz; Moritz Hardt; Celestine Mendler-Dünner

Evaluating language models as risk scores

André F. Cruz, Moritz Hardt, Celestine Mendler-Dünner

TL;DR

This work introduces folktexts, an open-source framework that converts census-based tabular prediction tasks into natural-language prompts to elicit LLM-derived risk scores and study their calibration. Across 17 LLMs and five ACS-based tasks, zero-shot multiple-choice prompts provide strong predictive signals but exhibit severe miscalibration, with base models overestimating uncertainty and instruction-tuned models underestimating it and becoming over-confident. Verbalized numeric prompting substantially improves calibration for instruction-tuned models, though at a modest AUC cost, revealing a key blind-spot in traditional realizable benchmarks. The findings highlight the need to evaluate uncertainty quantification in LLMs explicitly, particularly for consequential risk scoring, and suggest that future work should extend calibration techniques and incorporate more robust uncertainty measures. The folktexts suite enables systematic fairness auditing and exploration of how prompting and prompting style affect uncertainty representation in population statistics tasks.

Abstract

Current question-answering benchmarks predominantly focus on accuracy in realizable prediction tasks. Conditioned on a question and answer-key, does the most likely token match the ground truth? Such benchmarks necessarily fail to evaluate LLMs' ability to quantify ground-truth outcome uncertainty. In this work, we focus on the use of LLMs as risk scores for unrealizable prediction tasks. We introduce folktexts, a software package to systematically generate risk scores using LLMs, and evaluate them against US Census data products. A flexible API enables the use of different prompting schemes, local or web-hosted models, and diverse census columns that can be used to compose custom prediction tasks. We evaluate 17 recent LLMs across five proposed benchmark tasks. We find that zero-shot risk scores produced by multiple-choice question-answering have high predictive signal but are widely miscalibrated. Base models consistently overestimate outcome uncertainty, while instruction-tuned models underestimate uncertainty and produce over-confident risk scores. In fact, instruction-tuning polarizes answer distribution regardless of true underlying data uncertainty. This reveals a general inability of instruction-tuned LLMs to express data uncertainty using multiple-choice answers. A separate experiment using verbalized chat-style risk queries yields substantially improved calibration across instruction-tuned models. These differences in ability to quantify data uncertainty cannot be revealed in realizable settings, and highlight a blind-spot in the current evaluation ecosystem that folktexts covers.

Evaluating language models as risk scores

TL;DR

Abstract

Paper Structure (37 sections, 5 equations, 16 figures, 6 tables)

This paper contains 37 sections, 5 equations, 16 figures, 6 tables.

Introduction
Our contributions
Empirical insights.
Outline.
Limitations
Related work
Calibration.
Preliminaries
Calibration
Predictive performance
Instance ranking.
Evaluating language models as risk scores
Prediction tasks
Natural uncertainty in risk scoring.
Extracting risk scores
...and 22 more sections

Figures (16)

Figure 1: Information flow from tabular data to risk scores, using a supervised classifier (left) or a language model (right). The folktexts package maps language models to the traditional machine learning workflow.
Figure 2: Change in calibration error (ECE) of instruction-tuned models when using numeric risk prompting (orange circles) versus multiple-choice prompting (blue squares). Improvement/deterioration is represented by green/red arrows, respectively. An overwhelming majority of model/task pairs see calibration improvements.
Figure 3: Calibration curves for base and instruction-tuned versions of the largest models studied, on the ACSIncome task. Curves are computed using $10$ quantile-based score bins. Risk scores were generated using multiple-choice-style prompting (left plots) or numeric chat-style prompting (right plots).
Figure 4: Risk score confidence bias for all LLMs on the ACSIncome task. Negative values indicate under-confident risk scores (overestimating uncertainty), while positive values indicate over-confident risk scores (underestimating uncertainty). Instruction-tuned models are generally over-confident when using multiple-choice prompting (Fig. \ref{['fig:under_over_multiple_choice']}), but this bias is considerably diminished when using numeric prompting (Fig. \ref{['fig:under_over_numeric']}).
Figure 5: Risk score distribution for base and instruction-tuned model pairs on the ACSIncome task, using multiple-choice prompting. After instruction-tuning, models exhibit high confidence, but worse calibration in general. The XGBoost scores showcase a perfectly calibrated distribution ($\text{ECE} \approx 0.00$).
...and 11 more figures

Evaluating language models as risk scores

TL;DR

Abstract

Evaluating language models as risk scores

Authors

TL;DR

Abstract

Table of Contents

Figures (16)