Large Language Models are Contrastive Reasoners

Liang Yao

Large Language Models are Contrastive Reasoners

Liang Yao

TL;DR

The paper introduces Contrastive Prompting (CP), a template-based prompting approach that elicits both a correct and an incorrect answer to guide large language model reasoning without task-specific labeled demonstrations. It implements a two-stage process—reasoning extraction followed by answer extraction—via self-augmented prompts, and can be integrated with existing prompting methods (X-CP). Empirical results across GPT-4, GPT-3.5-Turbo, and open LLMs show substantial improvements over zero-shot and often over zero-shot-CoT on arithmetic, commonsense, and symbolic tasks, with notable gains on GSM8K and AQUA-RAT; CP can approach or surpass state-of-the-art results when combined with other prompting strategies. The method is simple to deploy, scales across models, and comes with available code, making contrastive reasoning more accessible and practical for a wide range of tasks. The work also discusses prompt-template effects and explores the impact of the number of generated incorrect answers, highlighting both strengths and limitations and outlining directions for future research on smaller models and deeper integration with advanced prompting techniques.

Abstract

Prompting methods play a crucial role in enhancing the capabilities of pre-trained large language models (LLMs). We explore how contrastive prompting (CP) significantly improves the ability of large language models to perform complex reasoning. We demonstrate that LLMs are decent contrastive reasoners by simply adding "Let's give a correct and a wrong answer." before LLMs provide answers. Experiments on various large language models show that zero-shot contrastive prompting improves the performance of standard zero-shot prompting on a range of arithmetic, commonsense, and symbolic reasoning tasks without any hand-crafted few-shot examples, such as increasing the accuracy on GSM8K from 35.9% to 88.8% and AQUA-RAT from 41.3% to 62.2% with the state-of-the-art GPT-4 model. Our method not only surpasses zero-shot CoT and few-shot CoT in most arithmetic and commonsense reasoning tasks but also can seamlessly integrate with existing prompting methods, resulting in improved or comparable results when compared to state-of-the-art methods. Our code is available at https://github.com/yao8839836/cp

Large Language Models are Contrastive Reasoners

TL;DR

Abstract

Paper Structure (28 sections, 16 figures, 10 tables)

This paper contains 28 sections, 16 figures, 10 tables.

Introduction
Related Works
Large language models and prompting
Learning from Negative Examples
Contrastive Prompting
Two-stage prompting
1st prompt: reasoning extraction
2nd prompt: answer extraction
Integrating with other prompting methods
Experiment
Settings
Datasets
Baselines
Models
Answer filtering
...and 13 more sections

Figures (16)

Figure 1: Example inputs and outputs of GPT-4 with (a) standard Zero-shot, and (b) ours (Zero-shot-CP). In contrast to Few-shot-CoT, which requires step-by-step reasoning examples for each task, our approach does not rely on any examples. Instead, we use the same prompt "Let's give a correct and a wrong answer" for all tasks, including arithmetic, symbolic, commonsense, and other logical reasoning tasks.
Figure 2: The complete process of Zero-shot-CP involves two steps: Firstly, we utilize the initial "reasoning" prompt to extract a comprehensive reasoning process from a LLM. Secondly, we employ the subsequent "answer" prompt to extract the correct answer from the reasoning text.
Figure 3: Accuracy scores by varying the number of wrong answers. We test GPT-4 and GPT-3.5-Turbo on (a) AQUA-RAT, (b) GSM8K, (c) AddSub and (d) MultiArith. The range of the number of wrong answers is from 0 (Zero-shot) to 4.
Figure 4: By setting the logprobs (log probabilities) parameter of the OpenAI API (using GPT-4), we printed the token output probabilities for different prompts. We provide an example in StrategyQA. The ground truth is "Yes". Note that the higher the logprobs value, the greater the probability. Zero-shot-CP makes GPT-4 more confident in the answer compared to Zero-shot and "Let's give a correct answer.".
Figure 5: Example outputs by Zero-shot-CP for AddSub.
...and 11 more figures

Large Language Models are Contrastive Reasoners

TL;DR

Abstract

Large Language Models are Contrastive Reasoners

Authors

TL;DR

Abstract

Table of Contents

Figures (16)