Table of Contents
Fetching ...

Evaluating the Correctness of Inference Patterns Used by LLMs for Judgment

Lu Chen, Yuxuan Huang, Yixing Li, Dongrui Liu, Qihan Ren, Shuai Zhao, Kun Kuang, Zilong Zheng, Quanshi Zhang

TL;DR

The paper addresses the risk that LLMs can produce correct judgments while relying on flawed inference patterns. It introduces an AND-OR interaction framework with universal matching and sparsity guarantees to quantify reliable versus unreliable patterns behind judgments, applied to legal LLMs. Through expert-annotated phrase categories and extensive experiments, it reveals that a substantial portion of salient interactions are unreliable, with low-order local cues and biased or incorrect factors contributing to decisions. The findings highlight important safety and fairness implications for deploying LLMs in high-stakes tasks and motivate extending interaction-based evaluation beyond language quality toward trustworthiness and interpretability.

Abstract

This paper presents a method to analyze the inference patterns used by Large Language Models (LLMs) for judgment in a case study on legal LLMs, so as to identify potential incorrect representations of the LLM, according to human domain knowledge. Unlike traditional evaluations on language generation results, we propose to evaluate the correctness of the detailed inference patterns of an LLM behind its seemingly correct outputs. To this end, we quantify the interactions between input phrases used by the LLM as primitive inference patterns, because recent theoretical achievements have proven several mathematical guarantees of the faithfulness of the interaction-based explanation. We design a set of metrics to evaluate the detailed inference patterns of LLMs. Experiments show that even when the language generation results appear correct, a significant portion of the inference patterns used by the LLM for the legal judgment may represent misleading or irrelevant logic.

Evaluating the Correctness of Inference Patterns Used by LLMs for Judgment

TL;DR

The paper addresses the risk that LLMs can produce correct judgments while relying on flawed inference patterns. It introduces an AND-OR interaction framework with universal matching and sparsity guarantees to quantify reliable versus unreliable patterns behind judgments, applied to legal LLMs. Through expert-annotated phrase categories and extensive experiments, it reveals that a substantial portion of salient interactions are unreliable, with low-order local cues and biased or incorrect factors contributing to decisions. The findings highlight important safety and fairness implications for deploying LLMs in high-stakes tasks and motivate extending interaction-based evaluation beyond language quality toward trustworthiness and interpretability.

Abstract

This paper presents a method to analyze the inference patterns used by Large Language Models (LLMs) for judgment in a case study on legal LLMs, so as to identify potential incorrect representations of the LLM, according to human domain knowledge. Unlike traditional evaluations on language generation results, we propose to evaluate the correctness of the detailed inference patterns of an LLM behind its seemingly correct outputs. To this end, we quantify the interactions between input phrases used by the LLM as primitive inference patterns, because recent theoretical achievements have proven several mathematical guarantees of the faithfulness of the interaction-based explanation. We design a set of metrics to evaluate the detailed inference patterns of LLMs. Experiments show that even when the language generation results appear correct, a significant portion of the inference patterns used by the LLM for the legal judgment may represent misleading or irrelevant logic.

Paper Structure

This paper contains 25 sections, 1 theorem, 17 equations, 22 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

When scalar weights in the logical model are set to $\forall S\subseteq N, I^{\text{\rm AND}}_S \overset{\text{def}}{=} \sum\nolimits_{T \subseteq S}(-1)^{|S|-|T|}v_{\text{and}}(\mathbf{x}_T)$The numerical effect of AND interaction pattern $I^{\text{\rm AND}}_S$ is also known as the Harsanyi divide where $\mathbf{x}_T$ is the masked samplenote2 that each input variable $i\in N\setminus T$ is mask

Figures (22)

  • Figure 1: Correctness of the detailed inference patterns of an LLM. The AND-OR logical model $h(\cdot)$ accurately fits the output score of the LLM $v(\cdot)$ when making the judgment "Assault" for Andy, $h($"Assault"$|\mathbf{x}) = v($"Assault"$|\mathbf{x})$, no matter how the input legal case $\mathbf{x}$ is masked in the bottom-right figure. Blue edges connect reliable interaction effects ($R^{\text{\rm AND}}_S$ and $R^{\text{\rm OR}}_S$) that contribute to the output score $v($"Assault"$|\mathbf{x})$, typically aligning with legal domain knowledge. Red edges connect unreliable interaction effects ($U^{\text{\rm AND}}_S$ and $U^{\text{\rm OR}}_S$) that contribute to $v($"Assault"$|\mathbf{x})$, often reflecting problematic patterns used by the LLM for the judgment.
  • Figure 2: Ratio of reliable interaction effects (measured by $s^{\text{\rm{reliable}}}$) among all the interaction patterns used by the LLM for judgment.
  • Figure 3: Distribution of all interactions over different orders (complexities) (denoted by $A^{(o),\text{\rm{pos}}}$ and $A^{(o),\text{\rm{neg}}}$) and that of all reliable interactions (denoted by $A^{(o), \text{\rm{pos}}}_{\text{\rm{reliable}}}$ and $A^{(o), \text{\rm{neg}}}_{\text{\rm{reliable}}}$).
  • Figure 4: Visualization of judgments affected by incorrect entities' actions. (a) Irrelevant phrases were annotated in the legal case, including the time and defendant's actions that were not the direct reason for the judgment. Criminal actions of the defendant were annotated as relevant phrases. Criminal actions of the unrelated person were annotated as forbidden phrases. (b) Judgments predicted by the two legal LLMs, which were both correct according to laws of the two countries. (c,d) We quantified the reliable and unreliable interaction effects.
  • Figure 5: Visualization of judgments biased by discrimination in identity. (a) Irrelevant phrases were annotated in the legal case, including the occupation, time and actions that are not the direct reason for the judgment. Criminal actions of the defendant were annotated as relevant phrases. (b) The SaulLM-7B-Instruct model predicted the judgment based on the legal case with different occupations. (c,d) We quantified the reliable and unreliable interaction effects.
  • ...and 17 more figures

Theorems & Definitions (4)

  • Theorem 1: Universal matching property, proof in \ref{['appx:universal']}
  • Definition 1: Ratio of reliable interaction effects
  • proof
  • proof