Table of Contents
Fetching ...

Pragmatic Reasoning improves LLM Code Generation

Zhuchen Cao, Sven Apel, Adish Singla, Vera Demberg

TL;DR

This paper tackles ambiguity in natural-language to code generation by introducing CodeRSA, a code candidate reranking method grounded in the Rational Speech Act framework. CodeRSA models a pragmatic listener and speaker to rank candidates, incorporating instruction clustering to handle paraphrase equivalence and prior-informed temperatures to integrate plausibility. Across two instruction-tuned LLMs (Llama-3-8B-Instruct and Qwen-2.5-7B-Instruct) and two benchmarks (HumanEval and MBPP), CodeRSA consistently outperforms Coder and CoderReviewer, with Robust performance and a best MBPP accuracy of 59.53% on one setting. The results demonstrate the value of applying pragmatic, RSA-based reasoning to enhance code generation quality and alignment with user intent.

Abstract

Large Language Models (LLMs) have demonstrated impressive potential in translating natural language (NL) instructions into program code. However, user instructions often contain inherent ambiguities, making it challenging for LLMs to generate code that accurately reflects the user's true intent. To address this challenge, researchers have proposed approaches that produce multiple candidates of the program code and then rerank them to identify the best solution. In this paper, we propose CodeRSA, a novel code candidate reranking mechanism built upon the Rational Speech Act (RSA) framework, designed to guide LLMs toward more comprehensive pragmatic reasoning about user intent. We evaluate CodeRSA using Llama-3-8B-Instruct and Qwen-2.5-7B-Instruct on two widely used code generation benchmarks, HumanEval and MBPP. Our experiment results show that CodeRSA consistently outperforms common baselines, surpasses the state-of-the-art approach in most cases, and demonstrates robust overall performance. These findings underscore the effectiveness of integrating pragmatic reasoning into code candidate reranking, offering a promising direction for enhancing code generation quality in LLMs.

Pragmatic Reasoning improves LLM Code Generation

TL;DR

This paper tackles ambiguity in natural-language to code generation by introducing CodeRSA, a code candidate reranking method grounded in the Rational Speech Act framework. CodeRSA models a pragmatic listener and speaker to rank candidates, incorporating instruction clustering to handle paraphrase equivalence and prior-informed temperatures to integrate plausibility. Across two instruction-tuned LLMs (Llama-3-8B-Instruct and Qwen-2.5-7B-Instruct) and two benchmarks (HumanEval and MBPP), CodeRSA consistently outperforms Coder and CoderReviewer, with Robust performance and a best MBPP accuracy of 59.53% on one setting. The results demonstrate the value of applying pragmatic, RSA-based reasoning to enhance code generation quality and alignment with user intent.

Abstract

Large Language Models (LLMs) have demonstrated impressive potential in translating natural language (NL) instructions into program code. However, user instructions often contain inherent ambiguities, making it challenging for LLMs to generate code that accurately reflects the user's true intent. To address this challenge, researchers have proposed approaches that produce multiple candidates of the program code and then rerank them to identify the best solution. In this paper, we propose CodeRSA, a novel code candidate reranking mechanism built upon the Rational Speech Act (RSA) framework, designed to guide LLMs toward more comprehensive pragmatic reasoning about user intent. We evaluate CodeRSA using Llama-3-8B-Instruct and Qwen-2.5-7B-Instruct on two widely used code generation benchmarks, HumanEval and MBPP. Our experiment results show that CodeRSA consistently outperforms common baselines, surpasses the state-of-the-art approach in most cases, and demonstrates robust overall performance. These findings underscore the effectiveness of integrating pragmatic reasoning into code candidate reranking, offering a promising direction for enhancing code generation quality in LLMs.

Paper Structure

This paper contains 21 sections, 17 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: A comparison of our approach CodeRSA (top) compared to CoderReviewer (bottom).
  • Figure 2: The prompts used to calculate Coder score and generate additional instructions.
  • Figure 3: Accuracy of CodeRSA across different values of the calibration parameter $\alpha$. The shaded region indicates a stable performance band.
  • Figure 4: Details of question and two generated examples
  • Figure 5: Coder Score Comparison
  • ...and 5 more figures