Table of Contents
Fetching ...

Effectiveness of Zero-shot-CoT in Japanese Prompts

Shusuke Takayama, Ian Frank

TL;DR

The paper investigates whether zero-shot Chain-of-Thought prompting improves reasoning tasks in Japanese compared with English, using GPT-3.5 and GPT-4o-mini on JMMLU and MMLU benchmarks. It systematically appends a CoT cue to prompts and analyzes task-level and language-level effects via A-D scoring. Across results, CoT generally reduces performance, with some English gains for GPT-3.5 and isolated Japanese math improvements, while GPT-4o-mini shows sizable declines in both languages except for a few Japanese categories. The findings suggest that CoT's utility is highly dependent on model architecture and language, underscoring the need for language- and model-specific prompting strategies.

Abstract

We compare the effectiveness of zero-shot Chain-of-Thought (CoT) prompting in Japanese and English using ChatGPT-3.5 and 4o-mini. The technique of zero-shot CoT, which involves appending a phrase such as "Let's think step by step" to a prompt to encourage reasoning before answering, has been shown to offer LLM performance improvements in mathematical and reasoning tasks, particularly in English. We investigate how these effects transfer to Japanese using the Japanese Multi-task Language Understanding Benchmark (JMMLU) and the Multi-task Language Understanding Benchmark (MMLU). Our results show that while zero-shot CoT prompting can lead to notable performance gains for some prompt categories in GPT-3.5, its impact in GPT-4o-mini is associated with significant performance declines. However, for Japanese prompts there remain certain categories, such as college mathematics and abstract algebra, that still exhibit improvements, despite the broader trend of diminishing effectiveness in more advanced models.

Effectiveness of Zero-shot-CoT in Japanese Prompts

TL;DR

The paper investigates whether zero-shot Chain-of-Thought prompting improves reasoning tasks in Japanese compared with English, using GPT-3.5 and GPT-4o-mini on JMMLU and MMLU benchmarks. It systematically appends a CoT cue to prompts and analyzes task-level and language-level effects via A-D scoring. Across results, CoT generally reduces performance, with some English gains for GPT-3.5 and isolated Japanese math improvements, while GPT-4o-mini shows sizable declines in both languages except for a few Japanese categories. The findings suggest that CoT's utility is highly dependent on model architecture and language, underscoring the need for language- and model-specific prompting strategies.

Abstract

We compare the effectiveness of zero-shot Chain-of-Thought (CoT) prompting in Japanese and English using ChatGPT-3.5 and 4o-mini. The technique of zero-shot CoT, which involves appending a phrase such as "Let's think step by step" to a prompt to encourage reasoning before answering, has been shown to offer LLM performance improvements in mathematical and reasoning tasks, particularly in English. We investigate how these effects transfer to Japanese using the Japanese Multi-task Language Understanding Benchmark (JMMLU) and the Multi-task Language Understanding Benchmark (MMLU). Our results show that while zero-shot CoT prompting can lead to notable performance gains for some prompt categories in GPT-3.5, its impact in GPT-4o-mini is associated with significant performance declines. However, for Japanese prompts there remain certain categories, such as college mathematics and abstract algebra, that still exhibit improvements, despite the broader trend of diminishing effectiveness in more advanced models.

Paper Structure

This paper contains 5 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Example of question formatting in MMLU and JMMLU, allowing evaluation under comparable conditions.
  • Figure 2: Performance across subjects for GPT-3.5 in English, comparing results with and without zero-shot CoT. While CoT improves performance in some reasoning-based subjects, the overall impact is mixed.
  • Figure 3: Performance across subjects for GPT-4o-mini in English. Unlike GPT-3.5, zero-shot CoT leads to a consistent decline, with no subjects showing improvement.
  • Figure 4: Performance across subjects for GPT-3.5 in Japanese. The impact of zero-shot CoT varies, with gains in mathematics but overall weaker improvements compared to English.
  • Figure 5: Performance across subjects for GPT-4o-mini in Japanese. While the overall trend remains negative, the decline is less severe compared to English, with a few subject areas benefiting from CoT.