Rethinking STS and NLI in Large Language Models
Yuxia Wang, Minghan Wang, Preslav Nakov
TL;DR
The paper investigates STS and NLI in the era of prompting-based LLMs, with a focus on clinical/biomedical domains. It evaluates zero-shot and few-shot prompting (including chain-of-thought) for general and domain-specific datasets, comparing against fine-tuned BERT-based baselines and open LLMs. Key findings show that fine-tuned BERT-base generally outperforms zero-shot ChatGPT, while LLMs exhibit persistent calibration gaps, especially for continuous STS outputs. The study also probes whether persona-based prompts can capture collective human opinions, concluding that current LLMs cannot reliably imitate annotator distributions. Overall, the work highlights enduring challenges in domain accuracy, uncertainty estimation, and distribution-aware evaluation, and outlines directions like prompt optimization and broader model/dataset coverage to advance STS/NLI in medical NLP.
Abstract
Recent years have seen the rise of large language models (LLMs), where practitioners use task-specific prompts; this was shown to be effective for a variety of tasks. However, when applied to semantic textual similarity (STS) and natural language inference (NLI), the effectiveness of LLMs turns out to be limited by low-resource domain accuracy, model overconfidence, and difficulty to capture the disagreements between human judgements. With this in mind, here we try to rethink STS and NLI in the era of LLMs. We first evaluate the performance of STS and NLI in the clinical/biomedical domain, and then we assess LLMs' predictive confidence and their capability of capturing collective human opinions. We find that these old problems are still to be properly addressed in the era of LLMs.
