Evaluating the Correctness of Inference Patterns Used by LLMs for Judgment
Lu Chen, Yuxuan Huang, Yixing Li, Dongrui Liu, Qihan Ren, Shuai Zhao, Kun Kuang, Zilong Zheng, Quanshi Zhang
TL;DR
The paper addresses the risk that LLMs can produce correct judgments while relying on flawed inference patterns. It introduces an AND-OR interaction framework with universal matching and sparsity guarantees to quantify reliable versus unreliable patterns behind judgments, applied to legal LLMs. Through expert-annotated phrase categories and extensive experiments, it reveals that a substantial portion of salient interactions are unreliable, with low-order local cues and biased or incorrect factors contributing to decisions. The findings highlight important safety and fairness implications for deploying LLMs in high-stakes tasks and motivate extending interaction-based evaluation beyond language quality toward trustworthiness and interpretability.
Abstract
This paper presents a method to analyze the inference patterns used by Large Language Models (LLMs) for judgment in a case study on legal LLMs, so as to identify potential incorrect representations of the LLM, according to human domain knowledge. Unlike traditional evaluations on language generation results, we propose to evaluate the correctness of the detailed inference patterns of an LLM behind its seemingly correct outputs. To this end, we quantify the interactions between input phrases used by the LLM as primitive inference patterns, because recent theoretical achievements have proven several mathematical guarantees of the faithfulness of the interaction-based explanation. We design a set of metrics to evaluate the detailed inference patterns of LLMs. Experiments show that even when the language generation results appear correct, a significant portion of the inference patterns used by the LLM for the legal judgment may represent misleading or irrelevant logic.
