Table of Contents
Fetching ...

Assessing the Impact of Prompting Methods on ChatGPT's Mathematical Capabilities

Yuhao Chen, Chloe Wong, Hanwen Yang, Juan Aguenza, Sai Bhujangari, Benthan Vu, Xun Lei, Amisha Prasad, Manny Fluss, Eric Phuong, Minghao Liu, Raja Kumar, Vanshika Vats, James Davis

TL;DR

It is suggested that prompting strategies do not necessarily generalize to new domains, in this study failing to enhance mathematical performance.

Abstract

This study critically evaluates the efficacy of prompting methods in enhancing the mathematical reasoning capability of large language models (LLMs). The investigation uses three prescriptive prompting methods - simple, persona, and conversational prompting - known for their effectiveness in enhancing the linguistic tasks of LLMs. We conduct this analysis on OpenAI's LLM chatbot, ChatGPT-3.5, on extensive problem sets from the MATH, GSM8K, and MMLU datasets, encompassing a broad spectrum of mathematical challenges. A grading script adapted to each dataset is used to determine the effectiveness of these prompting interventions in enhancing the model's mathematical analysis power. Contrary to expectations, our empirical analysis reveals that none of the investigated methods consistently improves over ChatGPT-3.5's baseline performance, with some causing significant degradation. Our findings suggest that prompting strategies do not necessarily generalize to new domains, in this study failing to enhance mathematical performance.

Assessing the Impact of Prompting Methods on ChatGPT's Mathematical Capabilities

TL;DR

It is suggested that prompting strategies do not necessarily generalize to new domains, in this study failing to enhance mathematical performance.

Abstract

This study critically evaluates the efficacy of prompting methods in enhancing the mathematical reasoning capability of large language models (LLMs). The investigation uses three prescriptive prompting methods - simple, persona, and conversational prompting - known for their effectiveness in enhancing the linguistic tasks of LLMs. We conduct this analysis on OpenAI's LLM chatbot, ChatGPT-3.5, on extensive problem sets from the MATH, GSM8K, and MMLU datasets, encompassing a broad spectrum of mathematical challenges. A grading script adapted to each dataset is used to determine the effectiveness of these prompting interventions in enhancing the model's mathematical analysis power. Contrary to expectations, our empirical analysis reveals that none of the investigated methods consistently improves over ChatGPT-3.5's baseline performance, with some causing significant degradation. Our findings suggest that prompting strategies do not necessarily generalize to new domains, in this study failing to enhance mathematical performance.
Paper Structure (11 sections, 3 figures, 2 tables)

This paper contains 11 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Illustration of prompting workflow: The figure presents an example each from (a) simple, (b) persona, and (c) conversational prompting methods. We initiate the conversation with ChatGPT using the defined prompts. We evaluate only the response to the subsequently asked mathematical question.
  • Figure 2: Examples of question-answer pairs in each datasets: The figure presents an example each from the (a) MATH, (b) MMLU, and (c) GSM8K datasets, along with their dataset provided ground truth answers (GT). Below each example is ChatGPT's response for comparison. Notice that ChatGPT's responses are neither straightforward nor completely aligned with the ground truth format. We thus need specially designed grading methods for evaluation of accuracy.
  • Figure 3: Visualization of change in accuracy: Baseline accuracy for each dataset is normalized to 0 (zero) on the y-axis. Bars denote $\Delta$ Accuracy, improvement or deterioration in the performance when using prompts with respect to the baseline. Notice that the only significant improvement from prompting was "Math Conversation" on the MMLU dataset. However this prompting method showed a significant degradation of performance on the GSM8K dataset. We conclude that none of the evaluated prompting methods provide generalizable gains on mathematical performance.