Table of Contents
Fetching ...

T-FIX: Text-Based Explanations with Features Interpretable to eXperts

Shreya Havaldar, Helen Jin, Chaehyeon Kim, Anton Xue, Weiqiu You, Marco Gatti, Bhuvnesh Jain, Helen Qu, Daniel A Hashimoto, Amin Madani, Rajat Deo, Sameed Ahmed M. Khatana, Gary E. Weissman, Lyle Ungar, Eric Wong

TL;DR

The paper reframes evaluation of LLM explanations by introducing expert alignment as a third orthogonal criterion besides plausibility and faithfulness, and presents T-FIX, a seven-domain benchmark co-developed with domain experts. It introduces a three-stage pipeline (atomic claim extraction, relevancy filtering, and alignment scoring) that converts free-form explanations into claim-level judgments aligned with domain criteria, with final aggregation yielding an expert-alignment score. Validation via annotation studies and domain expert interviews demonstrates reliable alignment signals but also reveals gaps, especially in biomedical domains where multi-criterion reasoning is essential. Across seven diverse domains and multiple evaluators, current models show limited ability to consistently produce expert-aligned explanations, underscoring a critical direction for model training and prompting strategies. The work provides a practical, extensible framework for evaluating expert-aligned explanations and outlines concrete paths for improving domain-specific epistemic validity in high-stakes settings.

Abstract

As LLMs are deployed in knowledge-intensive settings (e.g., surgery, astronomy, therapy), users expect not just answers, but also meaningful explanations for those answers. In these settings, users are often domain experts (e.g., doctors, astrophysicists, psychologists) who require explanations that reflect expert-level reasoning. However, current evaluation schemes primarily emphasize plausibility or internal faithfulness of the explanation, which fail to capture whether the content of the explanation truly aligns with expert intuition. We formalize expert alignment as a criterion for evaluating explanations with T-FIX, a benchmark spanning seven knowledge-intensive domains. In collaboration with domain experts, we develop novel metrics to measure the alignment of LLM explanations with expert judgment.

T-FIX: Text-Based Explanations with Features Interpretable to eXperts

TL;DR

The paper reframes evaluation of LLM explanations by introducing expert alignment as a third orthogonal criterion besides plausibility and faithfulness, and presents T-FIX, a seven-domain benchmark co-developed with domain experts. It introduces a three-stage pipeline (atomic claim extraction, relevancy filtering, and alignment scoring) that converts free-form explanations into claim-level judgments aligned with domain criteria, with final aggregation yielding an expert-alignment score. Validation via annotation studies and domain expert interviews demonstrates reliable alignment signals but also reveals gaps, especially in biomedical domains where multi-criterion reasoning is essential. Across seven diverse domains and multiple evaluators, current models show limited ability to consistently produce expert-aligned explanations, underscoring a critical direction for model training and prompting strategies. The work provides a practical, extensible framework for evaluating expert-aligned explanations and outlines concrete paths for improving domain-specific epistemic validity in high-stakes settings.

Abstract

As LLMs are deployed in knowledge-intensive settings (e.g., surgery, astronomy, therapy), users expect not just answers, but also meaningful explanations for those answers. In these settings, users are often domain experts (e.g., doctors, astrophysicists, psychologists) who require explanations that reflect expert-level reasoning. However, current evaluation schemes primarily emphasize plausibility or internal faithfulness of the explanation, which fail to capture whether the content of the explanation truly aligns with expert intuition. We formalize expert alignment as a criterion for evaluating explanations with T-FIX, a benchmark spanning seven knowledge-intensive domains. In collaboration with domain experts, we develop novel metrics to measure the alignment of LLM explanations with expert judgment.

Paper Structure

This paper contains 68 sections, 18 figures, 7 tables.

Figures (18)

  • Figure 1: Current evaluations of LLM explanations typically consider two dimensions: plausibility --- whether the reasoning is logically coherent and faithfulness --- whether it reflects the model’s true decision process. We introduce a third, orthogonal dimension: expert alignment --- whether the LLM reasons as a domain expert would. For instance, an LLM may correctly predict sepsis risk with a plausible and faithful explanation, yet because it relies on features clinicians rarely use, its expert alignment is low.
  • Figure 2: An overview of the T-FIX construction process. For each dataset, we first establish expert alignment criteria -- features deemed important by domain experts for a specific task -- through collaboration with these experts and LLM-based deep research tools. These criteria form the basis of the T-FIX evaluation pipeline, which processes an LLM-generated explanation to output an expert alignment score. A high score suggests the explanation reflects reasoning aligned with domain experts (i.e., the LLM "thinks like an expert"), while a low score indicates the explanation may rely on aspects that experts would deem irrelevant.
  • Figure 3: Our T-FIX pipeline. To evaluate an LLM-generated explanation, we first decompose it into atomic claims. Next, we filter out irrelevant claims, such as unsupported or speculative statements. Each remaining claim is then scored against the domain-specific expert alignment criteria: a score of "complete" indicates perfect overlap with at least one criterion, while "none" indicates no overlap. Filtered-out claims are automatically assigned a score of "none". We compute the final expert-alignment score for the explanation by averaging across all claim scores.
  • Figure 4: Overview of datasets and domains in T-FIX. We evaluate LLM expert alignment across seven diverse domains, spanning cosmology, psychology, and medicine. For each dataset, we highlight the motivating task, input–output format, representative example, and the expert responsible for validating alignment criteria. The final row summarizes the expert alignment criteria used for scoring explanations in each domain. The column colors reflect dataset modality: blue indicates vision, yellow indicates language, and pink indicates time-series.
  • Figure 5: Shannon Entropy of expert alignment criteria for GPT-4o. For each prompting baseline, we show coverage of each domain's explanations across all expert criteria -- a high value indicates the LLM considers many criteria across examples, while a low value indicates the LLM focuses on the same criteria repeatedly.
  • ...and 13 more figures