Self-Consistency Boosts Calibration for Math Reasoning

Ante Wang; Linfeng Song; Ye Tian; Baolin Peng; Lifeng Jin; Haitao Mi; Jinsong Su; Dong Yu

Self-Consistency Boosts Calibration for Math Reasoning

Ante Wang, Linfeng Song, Ye Tian, Baolin Peng, Lifeng Jin, Haitao Mi, Jinsong Su, Dong Yu

TL;DR

Three off-the-shelf calibration methods based on self-consistency for math reasoning tasks better bridge model confidence and accuracy than existing methods based on p(True) (Kadavath et al., 2022) or logit (Kadavath et al., 2022).

Abstract

Calibration, which establishes the correlation between accuracy and model confidence, is important for LLM development. We design three off-the-shelf calibration methods based on self-consistency (Wang et al., 2022) for math reasoning tasks. Evaluation on two popular benchmarks (GSM8K and MathQA) using strong open-source LLMs (Mistral and LLaMA2), our methods better bridge model confidence and accuracy than existing methods based on p(True) (Kadavath et al., 2022) or logit (Kadavath et al., 2022).

Self-Consistency Boosts Calibration for Math Reasoning

TL;DR

Abstract

Paper Structure (17 sections, 7 equations, 3 figures, 1 table)

This paper contains 17 sections, 7 equations, 3 figures, 1 table.

Introduction
Preview: Self-Consistency with CoT Prompting
Calibration using Self-Consistency
Cluster Number
Cluster Size
Pairwise Comparison
Experiments
Setup
Datasets
Evaluation Metrics
Settings
Baselines
Results and Analysis
Main Results
Influence of Sample Size $N$
...and 2 more sections

Figures (3)

Figure 1: Comparison of several calibration methods on Mistral-7B, where SC w/ $\mathcal{F}_{CN}$ is one of our methods based on self-consistency, which will be introduced in § \ref{['sec:method']}.
Figure 2: Calibration results on GSM8K when using Mixtral-8$\times$7B-Inst with different $N$.
Figure 3: Performance and calibration results on GSM8K using different models below sorted by their performance: ① LLaMA2-7B-Chat, ② LLaMA2-13B-Chat, ③ Mistral-7B-Inst, ④ LLaMA2-70B-Chat, ⑤ Mixtral-8$\times$7B-Inst.

Self-Consistency Boosts Calibration for Math Reasoning

TL;DR

Abstract

Self-Consistency Boosts Calibration for Math Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (3)