Effectiveness of Zero-shot-CoT in Japanese Prompts
Shusuke Takayama, Ian Frank
TL;DR
The paper investigates whether zero-shot Chain-of-Thought prompting improves reasoning tasks in Japanese compared with English, using GPT-3.5 and GPT-4o-mini on JMMLU and MMLU benchmarks. It systematically appends a CoT cue to prompts and analyzes task-level and language-level effects via A-D scoring. Across results, CoT generally reduces performance, with some English gains for GPT-3.5 and isolated Japanese math improvements, while GPT-4o-mini shows sizable declines in both languages except for a few Japanese categories. The findings suggest that CoT's utility is highly dependent on model architecture and language, underscoring the need for language- and model-specific prompting strategies.
Abstract
We compare the effectiveness of zero-shot Chain-of-Thought (CoT) prompting in Japanese and English using ChatGPT-3.5 and 4o-mini. The technique of zero-shot CoT, which involves appending a phrase such as "Let's think step by step" to a prompt to encourage reasoning before answering, has been shown to offer LLM performance improvements in mathematical and reasoning tasks, particularly in English. We investigate how these effects transfer to Japanese using the Japanese Multi-task Language Understanding Benchmark (JMMLU) and the Multi-task Language Understanding Benchmark (MMLU). Our results show that while zero-shot CoT prompting can lead to notable performance gains for some prompt categories in GPT-3.5, its impact in GPT-4o-mini is associated with significant performance declines. However, for Japanese prompts there remain certain categories, such as college mathematics and abstract algebra, that still exhibit improvements, despite the broader trend of diminishing effectiveness in more advanced models.
