Table of Contents
Fetching ...

Task Calibration: Calibrating Large Language Models on Inference Tasks

Yingjie Li, Yun Luo, Xiaotian Xie, Yue Zhang

Abstract

Large language models (LLMs) have exhibited impressive zero-shot performance on inference tasks. However, LLMs may suffer from spurious correlations between input texts and output labels, which limits LLMs' ability to reason based purely on general language understanding. In other words, LLMs may make predictions primarily based on premise or hypothesis, rather than both components. To address this problem that may lead to unexpected performance degradation, we propose task calibration (TC), a zero-shot and inference-only calibration method inspired by mutual information which recovers LLM performance through task reformulation. TC encourages LLMs to reason based on both premise and hypothesis, while mitigating the models' over-reliance on individual premise or hypothesis for inference. Experimental results show that TC achieves a substantial improvement on 13 inference tasks in the zero-shot setup. We further validate the effectiveness of TC in few-shot setups and various natural language understanding tasks. Further analysis indicates that TC is also robust to prompt templates and has the potential to be integrated with other calibration methods.

Task Calibration: Calibrating Large Language Models on Inference Tasks

Abstract

Large language models (LLMs) have exhibited impressive zero-shot performance on inference tasks. However, LLMs may suffer from spurious correlations between input texts and output labels, which limits LLMs' ability to reason based purely on general language understanding. In other words, LLMs may make predictions primarily based on premise or hypothesis, rather than both components. To address this problem that may lead to unexpected performance degradation, we propose task calibration (TC), a zero-shot and inference-only calibration method inspired by mutual information which recovers LLM performance through task reformulation. TC encourages LLMs to reason based on both premise and hypothesis, while mitigating the models' over-reliance on individual premise or hypothesis for inference. Experimental results show that TC achieves a substantial improvement on 13 inference tasks in the zero-shot setup. We further validate the effectiveness of TC in few-shot setups and various natural language understanding tasks. Further analysis indicates that TC is also robust to prompt templates and has the potential to be integrated with other calibration methods.

Paper Structure

This paper contains 17 sections, 5 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: An example from QNLI dataset qnli. Sentence-Only, Question-Only and Both indicate the inputs with only the sentence, question and using both components, respectively. While the initial model prediction is incorrect, potentially due to the influence of the hypothesis, we observe that task calibration finally leads to a correct prediction.
  • Figure 2: The percentage of LLM predictions on label not_entailment (NLI) with premise-only and hypothesis-only inputs. Higher value indicates low bias.
  • Figure 3: The percentage of erroneous LLM predictions that align with the labels derived from premise-only or hypothesis-only inputs. Higher value indicates high correlation.
  • Figure 4: The few-shot performance of Mistral-7B-Instruct-v0.3 using various calibration methods over the number of in-context learning (ICL) shots. Lines and shades denote the mean and standard deviation, respectively, for 5 randomly sampled sets used for few-shot inference.
  • Figure 5: The means and standard deviations over the five different templates considered for CB, RTE, PAWS and VAST datasets. '*' indicates the significant improvement in performance over the original LLM (paired t-test with p $\leq$ 0.05).