Table of Contents
Fetching ...

Self-Consistency Boosts Calibration for Math Reasoning

Ante Wang, Linfeng Song, Ye Tian, Baolin Peng, Lifeng Jin, Haitao Mi, Jinsong Su, Dong Yu

TL;DR

Three off-the-shelf calibration methods based on self-consistency for math reasoning tasks better bridge model confidence and accuracy than existing methods based on p(True) (Kadavath et al., 2022) or logit (Kadavath et al., 2022).

Abstract

Calibration, which establishes the correlation between accuracy and model confidence, is important for LLM development. We design three off-the-shelf calibration methods based on self-consistency (Wang et al., 2022) for math reasoning tasks. Evaluation on two popular benchmarks (GSM8K and MathQA) using strong open-source LLMs (Mistral and LLaMA2), our methods better bridge model confidence and accuracy than existing methods based on p(True) (Kadavath et al., 2022) or logit (Kadavath et al., 2022).

Self-Consistency Boosts Calibration for Math Reasoning

TL;DR

Three off-the-shelf calibration methods based on self-consistency for math reasoning tasks better bridge model confidence and accuracy than existing methods based on p(True) (Kadavath et al., 2022) or logit (Kadavath et al., 2022).

Abstract

Calibration, which establishes the correlation between accuracy and model confidence, is important for LLM development. We design three off-the-shelf calibration methods based on self-consistency (Wang et al., 2022) for math reasoning tasks. Evaluation on two popular benchmarks (GSM8K and MathQA) using strong open-source LLMs (Mistral and LLaMA2), our methods better bridge model confidence and accuracy than existing methods based on p(True) (Kadavath et al., 2022) or logit (Kadavath et al., 2022).
Paper Structure (17 sections, 7 equations, 3 figures, 1 table)

This paper contains 17 sections, 7 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Comparison of several calibration methods on Mistral-7B, where SC w/ $\mathcal{F}_{CN}$ is one of our methods based on self-consistency, which will be introduced in § \ref{['sec:method']}.
  • Figure 2: Calibration results on GSM8K when using Mixtral-8$\times$7B-Inst with different $N$.
  • Figure 3: Performance and calibration results on GSM8K using different models below sorted by their performance: ① LLaMA2-7B-Chat, ② LLaMA2-13B-Chat, ③ Mistral-7B-Inst, ④ LLaMA2-70B-Chat, ⑤ Mixtral-8$\times$7B-Inst.