Table of Contents
Fetching ...

DIALEVAL: Automated Type-Theoretic Evaluation of LLM Instruction Following

Nardine Basta, Dali Kaafar

TL;DR

DIALEVAL is presented, a type-theoretic framework using dual LLM agents to automate instruction decomposition into typed predicates and implement type-specific satisfaction semantics, which enables evaluation in conversational contexts where single-turn methods fail.

Abstract

Evaluating instruction following in Large Language Models requires decomposing instructions into verifiable requirements and assessing satisfaction--tasks currently dependent on manual annotation and uniform criteria that do not align with human judgment patterns. We present DIALEVAL, a type-theoretic framework using dual LLM agents to automate instruction decomposition into typed predicates and implement type-specific satisfaction semantics. The framework enforces formal atomicity and independence constraints during automated extraction, then applies differentiated evaluation criteria--semantic equivalence for content predicates, exact precision for numerical predicates--mirroring empirically observed human assessment patterns. Extended to multi-turn dialogues through history-aware satisfaction functions, DIALEVAL enables evaluation in conversational contexts where single-turn methods fail. Validation demonstrates 90.38% accuracy (26.45% error reduction over baselines) and substantially stronger correlation with human judgment for complex instructions.

DIALEVAL: Automated Type-Theoretic Evaluation of LLM Instruction Following

TL;DR

DIALEVAL is presented, a type-theoretic framework using dual LLM agents to automate instruction decomposition into typed predicates and implement type-specific satisfaction semantics, which enables evaluation in conversational contexts where single-turn methods fail.

Abstract

Evaluating instruction following in Large Language Models requires decomposing instructions into verifiable requirements and assessing satisfaction--tasks currently dependent on manual annotation and uniform criteria that do not align with human judgment patterns. We present DIALEVAL, a type-theoretic framework using dual LLM agents to automate instruction decomposition into typed predicates and implement type-specific satisfaction semantics. The framework enforces formal atomicity and independence constraints during automated extraction, then applies differentiated evaluation criteria--semantic equivalence for content predicates, exact precision for numerical predicates--mirroring empirically observed human assessment patterns. Extended to multi-turn dialogues through history-aware satisfaction functions, DIALEVAL enables evaluation in conversational contexts where single-turn methods fail. Validation demonstrates 90.38% accuracy (26.45% error reduction over baselines) and substantially stronger correlation with human judgment for complex instructions.
Paper Structure (13 sections, 10 equations, 5 figures, 3 tables)

This paper contains 13 sections, 10 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: DIALEVAL dual-agent architecture. The Instruction Analysis Agent decomposes instructions into typed predicates. The Evaluation Agent performs type-specific satisfaction evaluation with the response, producing binary judgments and computing the Utterance-level Instruction Following Score (UIFS).
  • Figure 2: Core DIALEVAL prompts for instruction analysis (a) and evaluation (b).
  • Figure 3: DIALEVAL dialogue-specific extensions incorporating dialogue history for instruction analysis (a) and evaluation (b).
  • Figure 4: DIALEVAL validation against human evaluation.
  • Figure 5: Per-instruction performance. Instruction IDs: (1) Initiate conversation by asking about caller identity, (2) Response length $<$ 30 words (3) Express hesitation for sensitive information, (4) Provide plausible fake information, (5) Maintain conversation flow, (6) Maintain naive persona, (7) Build upon previous dialogue context.