Table of Contents
Fetching ...

Rethinking STS and NLI in Large Language Models

Yuxia Wang, Minghan Wang, Preslav Nakov

TL;DR

The paper investigates STS and NLI in the era of prompting-based LLMs, with a focus on clinical/biomedical domains. It evaluates zero-shot and few-shot prompting (including chain-of-thought) for general and domain-specific datasets, comparing against fine-tuned BERT-based baselines and open LLMs. Key findings show that fine-tuned BERT-base generally outperforms zero-shot ChatGPT, while LLMs exhibit persistent calibration gaps, especially for continuous STS outputs. The study also probes whether persona-based prompts can capture collective human opinions, concluding that current LLMs cannot reliably imitate annotator distributions. Overall, the work highlights enduring challenges in domain accuracy, uncertainty estimation, and distribution-aware evaluation, and outlines directions like prompt optimization and broader model/dataset coverage to advance STS/NLI in medical NLP.

Abstract

Recent years have seen the rise of large language models (LLMs), where practitioners use task-specific prompts; this was shown to be effective for a variety of tasks. However, when applied to semantic textual similarity (STS) and natural language inference (NLI), the effectiveness of LLMs turns out to be limited by low-resource domain accuracy, model overconfidence, and difficulty to capture the disagreements between human judgements. With this in mind, here we try to rethink STS and NLI in the era of LLMs. We first evaluate the performance of STS and NLI in the clinical/biomedical domain, and then we assess LLMs' predictive confidence and their capability of capturing collective human opinions. We find that these old problems are still to be properly addressed in the era of LLMs.

Rethinking STS and NLI in Large Language Models

TL;DR

The paper investigates STS and NLI in the era of prompting-based LLMs, with a focus on clinical/biomedical domains. It evaluates zero-shot and few-shot prompting (including chain-of-thought) for general and domain-specific datasets, comparing against fine-tuned BERT-based baselines and open LLMs. Key findings show that fine-tuned BERT-base generally outperforms zero-shot ChatGPT, while LLMs exhibit persistent calibration gaps, especially for continuous STS outputs. The study also probes whether persona-based prompts can capture collective human opinions, concluding that current LLMs cannot reliably imitate annotator distributions. Overall, the work highlights enduring challenges in domain accuracy, uncertainty estimation, and distribution-aware evaluation, and outlines directions like prompt optimization and broader model/dataset coverage to advance STS/NLI in medical NLP.

Abstract

Recent years have seen the rise of large language models (LLMs), where practitioners use task-specific prompts; this was shown to be effective for a variety of tasks. However, when applied to semantic textual similarity (STS) and natural language inference (NLI), the effectiveness of LLMs turns out to be limited by low-resource domain accuracy, model overconfidence, and difficulty to capture the disagreements between human judgements. With this in mind, here we try to rethink STS and NLI in the era of LLMs. We first evaluate the performance of STS and NLI in the clinical/biomedical domain, and then we assess LLMs' predictive confidence and their capability of capturing collective human opinions. We find that these old problems are still to be properly addressed in the era of LLMs.
Paper Structure (54 sections, 1 equation, 5 figures, 14 tables)

This paper contains 54 sections, 1 equation, 5 figures, 14 tables.

Figures (5)

  • Figure 1: USTS-C ($\mu$, $\sigma$) distribution of annotators versus ChatGPT roles and ten runs by the role of linguistic expert, and KL-Divergence (bottom) between the collective human opinions and the distribution of predictions by ten different roles using ChatGPT.
  • Figure 2: ChaosNLI KL-Divergence (top) and Jensen–Shannon distance (bottom) between the collective human opinions and the distribution with bootstrap under predictions by ten different roles using ChatGPT. KL highly correlates with JSD ($r \ge$0.88 and $\rho \ge$ 0.97).
  • Figure 3: Similarity Score distribution of STS-B (top) and N2C2-STS (bottom) by LLaMA-2 (7B). Ref=Gold labels
  • Figure 4: The impact of system role on the performance of domain datasets using ChatGPT.
  • Figure 5: ChaosNLI five examples. JSD between distribution of annotators and ChatGPT distributions ranges from 0.2, 0.4, 0.6, 0.7 to 0.9.