Table of Contents
Fetching ...

Amulet: Putting Complex Multi-Turn Conversations on the Stand with LLM Juries

Sahana Ramnath, Anurag Mudgil, Brihi Joshi, Skyler Hallinan, Xiang Ren

TL;DR

Amulet addresses the challenge of evaluating LLM judges on complex, multi-turn human–AI conversations by grounding judgments in dialog acts and Gricean maxims. It delivers two prompts (Amulet-DA and Amulet-Maxim) and two jury designs (Amulet-LM-Jury and Amulet-RM-Jury) that jointly improve judgment accuracy across four challenging datasets. The work demonstrates that humans frequently shift intents across turns and that DA and Maxim signals distinguish preferred responses in a majority of cases, with certain maxims consistently informative. Overall, Amulet provides a lightweight, adaptable augmentation to strong judges and reward models, though it also highlights practical limits such as data contamination, bias, and computational costs.

Abstract

Today, large language models are widely used as judges to evaluate responses from other language models. Hence, it is imperative to benchmark and improve these LLM-judges on real-world language model usage: a typical human-assistant conversation is lengthy, and shows significant diversity in topics, intents, and requirements across turns, e.g. social interactions, task requests, feedback. We present Amulet, a framework that leverages pertinent linguistic concepts of dialog-acts and maxims to improve the accuracy of LLM-judges on preference data with complex, multi-turn conversational context. Amulet presents valuable insights about (a) the communicative structures and intents present in the conversation (dialog acts), and (b) the satisfaction of conversational principles (maxims) by the preference responses, and uses them to make judgments. On four challenging datasets, Amulet shows that (a) humans frequently (60 to 70 percent of the time) change their intents from one turn of the conversation to the next, and (b) in 75 percent of instances, the preference responses can be differentiated via dialog acts and/or maxims, reiterating the latter's significance in judging such data. Amulet can be used either as a judge by applying the framework to a single LLM, or integrated into a jury with different LLM judges; our judges and juries show strong improvements on relevant baselines for all four datasets.

Amulet: Putting Complex Multi-Turn Conversations on the Stand with LLM Juries

TL;DR

Amulet addresses the challenge of evaluating LLM judges on complex, multi-turn human–AI conversations by grounding judgments in dialog acts and Gricean maxims. It delivers two prompts (Amulet-DA and Amulet-Maxim) and two jury designs (Amulet-LM-Jury and Amulet-RM-Jury) that jointly improve judgment accuracy across four challenging datasets. The work demonstrates that humans frequently shift intents across turns and that DA and Maxim signals distinguish preferred responses in a majority of cases, with certain maxims consistently informative. Overall, Amulet provides a lightweight, adaptable augmentation to strong judges and reward models, though it also highlights practical limits such as data contamination, bias, and computational costs.

Abstract

Today, large language models are widely used as judges to evaluate responses from other language models. Hence, it is imperative to benchmark and improve these LLM-judges on real-world language model usage: a typical human-assistant conversation is lengthy, and shows significant diversity in topics, intents, and requirements across turns, e.g. social interactions, task requests, feedback. We present Amulet, a framework that leverages pertinent linguistic concepts of dialog-acts and maxims to improve the accuracy of LLM-judges on preference data with complex, multi-turn conversational context. Amulet presents valuable insights about (a) the communicative structures and intents present in the conversation (dialog acts), and (b) the satisfaction of conversational principles (maxims) by the preference responses, and uses them to make judgments. On four challenging datasets, Amulet shows that (a) humans frequently (60 to 70 percent of the time) change their intents from one turn of the conversation to the next, and (b) in 75 percent of instances, the preference responses can be differentiated via dialog acts and/or maxims, reiterating the latter's significance in judging such data. Amulet can be used either as a judge by applying the framework to a single LLM, or integrated into a jury with different LLM judges; our judges and juries show strong improvements on relevant baselines for all four datasets.

Paper Structure

This paper contains 75 sections, 7 equations, 8 figures, 20 tables.

Figures (8)

  • Figure 1: Real-world language model usage typically includes lengthy, complex human-assistant conversations, where humans express varying intents and requirements across the turns. Given preference data with such context, how accurate are LLM-judges in predicting which response is better? We develop a framework, Amulet, that uses the following linguistic concepts for the same: (a) dialog-acts (DA) to analyze the communicative structure of each turn in the conversation, and (b) maxims to compare the preference responses in terms of principles such as informativity, truth, relevance, etc.
  • Figure 2: Amulet-DA uses dialog-acts to analyze communicative structures in the conversation. In the example above, the second human turn "Okay thanks. And what are some typical forms required?" has the structures of feedback/positive, social/thanking and task/question. Amulet-Maxim analyzes which conversational principles are satisfied by the preference responses. In the example above, Response-1 is better than Response-2 at most of the maxims. Amulet uses these annotations to give more accurate preference judgments.
  • Figure 3: Graphs for Section \ref{['sec:analyze-conv']}: (a) Frequency of most common functions in WildFeedback, (b) Distribution of #conversations where the human turns' $\#\textsc{DA}\xspace's \geq x$ for WildFeedback, (c) % of consecutive human turns with different DA's, (d) % of consecutive assistant turns with different DA's when the corresponding consecutive human DA's are different. (acronyms, eg:- PQ is Propositional Question, SQ is Set Question, etc. in Appendix \ref{['app:remaining-analysis-conv-figs']})
  • Figure 4: For each maxim on the x-axis, we measure the % of conversations in the dataset where (a) the chosen response satisfies the maxim better than the rejected, (b) the rejected response satisfies the maxim better, (c) both responses satisfy the maxim equally, (d) neither response satisfies the maxim. We see that maxims such as Quantity-1, Relevance-1, etc. are important in all datasets to distinguish between the chosen and rejected responses, and maxims such as Benevolence-2 are only significant in certain datasets, Nectar.
  • Figure 5: Voting pipeline in Amulet-LM-Jury and Amulet-RM-Jury.
  • ...and 3 more figures