Table of Contents
Fetching ...

ElicitationGPT: Text Elicitation Mechanisms via Language Models

Yifan Wu, Jason Hartline

TL;DR

This work develops ElicitationGPT, a textual information elicitation framework that reduces open-ended text to a numerical forecast using domain-knowledge-free LLM queries (summarization and QA) and proper scoring rules. The authors prove conditions for (approximate) properness and demonstrate adversarial robustness, applying the approach to peer grading data where ground-truth alignment with instructor scores and overall student performance is strong. By treating LLMs as oracle-based components within an algorithmic AI paradigm, the paper provides guarantees beyond direct prompting and shows text-based scoring can outperform traditional numeric rubrics in capturing true performance. The methodology offers a scalable, guarantee-bearing avenue for high-quality textual data elicitation with broad potential applications beyond education.

Abstract

Scoring rules evaluate probabilistic forecasts of an unknown state against the realized state and are a fundamental building block in the incentivized elicitation of information. This paper develops mechanisms for scoring elicited text against ground truth text by reducing the textual information elicitation problem to a forecast elicitation problem, via domain-knowledge-free queries to a large language model (specifically ChatGPT), and empirically evaluates their alignment with human preferences. Our theoretical analysis shows that the reduction achieves provable properness via black-box language models. The empirical evaluation is conducted on peer reviews from a peer-grading dataset, in comparison to manual instructor scores for the peer reviews. Our results suggest a paradigm of algorithmic artificial intelligence that may be useful for developing artificial intelligence technologies with provable guarantees.

ElicitationGPT: Text Elicitation Mechanisms via Language Models

TL;DR

This work develops ElicitationGPT, a textual information elicitation framework that reduces open-ended text to a numerical forecast using domain-knowledge-free LLM queries (summarization and QA) and proper scoring rules. The authors prove conditions for (approximate) properness and demonstrate adversarial robustness, applying the approach to peer grading data where ground-truth alignment with instructor scores and overall student performance is strong. By treating LLMs as oracle-based components within an algorithmic AI paradigm, the paper provides guarantees beyond direct prompting and shows text-based scoring can outperform traditional numeric rubrics in capturing true performance. The methodology offers a scalable, guarantee-bearing avenue for high-quality textual data elicitation with broad potential applications beyond education.

Abstract

Scoring rules evaluate probabilistic forecasts of an unknown state against the realized state and are a fundamental building block in the incentivized elicitation of information. This paper develops mechanisms for scoring elicited text against ground truth text by reducing the textual information elicitation problem to a forecast elicitation problem, via domain-knowledge-free queries to a large language model (specifically ChatGPT), and empirically evaluates their alignment with human preferences. Our theoretical analysis shows that the reduction achieves provable properness via black-box language models. The empirical evaluation is conducted on peer reviews from a peer-grading dataset, in comparison to manual instructor scores for the peer reviews. Our results suggest a paradigm of algorithmic artificial intelligence that may be useful for developing artificial intelligence technologies with provable guarantees.
Paper Structure (61 sections, 17 theorems, 25 equations, 1 figure, 7 tables)

This paper contains 61 sections, 17 theorems, 25 equations, 1 figure, 7 tables.

Key Result

Theorem 1

$\text{Elicitation}^{\text{GPT}}$ with perfect language oracles is proper.

Figures (1)

  • Figure 1: The V-shaped scoring rule, the optimal single-dimensional scoring rule from LHSW-22. Once fixing the report $r$, the score is linear in the state $\theta$. The scoring rule offers two linear score functions for the agent to select. When $r\leq \mu_p$, the agent selects the line $S(0; 0)$ to $S(0; 1)$. Otherwise, the agent selects the line $S(1; 0)$ to $S(1; 1)$.

Theorems & Definitions (41)

  • Definition 1: Properness
  • Definition 2: Approximate Properness
  • Definition 3: Quadratic
  • Definition 4: V-shaped
  • Definition 5: Multi-dimensional Aggregation
  • Definition 6: Average Aggregation
  • Definition 7: Max-over-separate Aggregation
  • Definition 8: Proper Scoring Rules for Ternary Reports
  • Definition 9: V-shaped for Ternary Reports
  • Definition 10: $\text{Elicitation}^{\text{GPT}}$
  • ...and 31 more