Table of Contents
Fetching ...

Python is Not Always the Best Choice: Embracing Multilingual Program of Thoughts

Xianzhen Luo, Qingfu Zhu, Zhiming Zhang, Libo Qin, Xuanyu Zhang, Qing Yang, Dongliang Xu, Wanxiang Che

TL;DR

This paper proposes a task and model agnostic approach called MultiPoT, which achieves comparable or superior performance compared to the best monolingual PoT in almost all tasks across all models and significantly outperforms Python Self-Consistency.

Abstract

Program of Thoughts (PoT) is an approach characterized by its executable intermediate steps, which ensure the accuracy of the logical calculations in the reasoning process. Currently, PoT primarily uses Python. However, relying solely on a single language may result in suboptimal solutions and overlook the potential benefits of other programming languages. In this paper, we conduct comprehensive experiments on the programming languages used in PoT and find that no single language consistently delivers optimal performance across all tasks and models. The effectiveness of each language varies depending on the specific scenarios. Inspired by this, we propose a task and model agnostic approach called MultiPoT, which harnesses strength and diversity from various languages. Experimental results reveal that it significantly outperforms Python Self-Consistency. Furthermore, it achieves comparable or superior performance compared to the best monolingual PoT in almost all tasks across all models. In particular, MultiPoT achieves more than 4.6% improvement on average on ChatGPT (gpt-3.5-turbo-0701).

Python is Not Always the Best Choice: Embracing Multilingual Program of Thoughts

TL;DR

This paper proposes a task and model agnostic approach called MultiPoT, which achieves comparable or superior performance compared to the best monolingual PoT in almost all tasks across all models and significantly outperforms Python Self-Consistency.

Abstract

Program of Thoughts (PoT) is an approach characterized by its executable intermediate steps, which ensure the accuracy of the logical calculations in the reasoning process. Currently, PoT primarily uses Python. However, relying solely on a single language may result in suboptimal solutions and overlook the potential benefits of other programming languages. In this paper, we conduct comprehensive experiments on the programming languages used in PoT and find that no single language consistently delivers optimal performance across all tasks and models. The effectiveness of each language varies depending on the specific scenarios. Inspired by this, we propose a task and model agnostic approach called MultiPoT, which harnesses strength and diversity from various languages. Experimental results reveal that it significantly outperforms Python Self-Consistency. Furthermore, it achieves comparable or superior performance compared to the best monolingual PoT in almost all tasks across all models. In particular, MultiPoT achieves more than 4.6% improvement on average on ChatGPT (gpt-3.5-turbo-0701).
Paper Structure (26 sections, 20 figures, 18 tables)

This paper contains 26 sections, 20 figures, 18 tables.

Figures (20)

  • Figure 1: Comparison of PoT with different PLs. Python's 'timedelta' lacks support for year computation, leading to a leap year (2008 has 366 days) error by subtracting 365 days. R and JavaScript directly compute the year and get the correct answer.
  • Figure 2: An overview of MultiPoT and Self-Consistency. MultiPoT first constructs prompts for each PL, ensuring a consistent reasoning process while also considering the distinct coding styles. It then integrates these PLs: generating multilingual PoTs based on the prompts, executing them to gather results, and finally voting for the answer. In contrast to Self-Consistency’s single-language focus, MultiPoT leverages multiple PLs.
  • Figure 3: The greedy decoding performance of three models across five tasks in five different PLs. AVG denotes the average performance of a PL across all tasks. Each language performance is expressed as a ratio to the highest-performing language for that specific task. The center of the circle represents 50%. Detailed numerical data are provided in the Table \ref{['tab:re1']} in Appendix \ref{['sec:appendix_results']}.
  • Figure 4: The reasoning ability, code generation ability, and percentage in pre-training data for different languages. Generation lacks data for R. The horizontal coordinates of each model are ranked according to the rise in reasoning performance (excluding R).
  • Figure 5: The impact of the number of integrating PLs. We test the different order of adding languages.
  • ...and 15 more figures