Table of Contents
Fetching ...

Bridge-Coder: Unlocking LLMs' Potential to Overcome Language Gaps in Low-Resource Code

Jipeng Zhang, Jianshu Zhang, Yuanzhe Li, Renjie Pi, Rui Pan, Runtao Liu, Ziqiang Zheng, Tong Zhang

TL;DR

A novel approach called Bridge-Coder is introduced, which leverages LLMs' intrinsic capabilities to enhance the performance on LRPLs and applies the Bridged Alignment, which progressively improves the alignment between NL instructions and LRPLs.

Abstract

Large Language Models (LLMs) demonstrate strong proficiency in generating code for high-resource programming languages (HRPLs) like Python but struggle significantly with low-resource programming languages (LRPLs) such as Racket or D. This performance gap deepens the digital divide, preventing developers using LRPLs from benefiting equally from LLM advancements and reinforcing disparities in innovation within underrepresented programming communities. While generating additional training data for LRPLs is promising, it faces two key challenges: manual annotation is labor-intensive and costly, and LLM-generated LRPL code is often of subpar quality. The underlying cause of this issue is the gap between natural language to programming language gap (NL-PL Gap), which is especially pronounced in LRPLs due to limited aligned data. In this work, we introduce a novel approach called Bridge-Coder, which leverages LLMs' intrinsic capabilities to enhance the performance on LRPLs. Our method consists of two key stages. Bridge Generation, where we create high-quality dataset by utilizing LLMs' general knowledge understanding, proficiency in HRPLs, and in-context learning abilities. Then, we apply the Bridged Alignment, which progressively improves the alignment between NL instructions and LRPLs. Experimental results across multiple LRPLs show that Bridge-Coder significantly enhances model performance, demonstrating the effectiveness and generalization of our approach. Furthermore, we offer a detailed analysis of the key components of our method, providing valuable insights for future work aimed at addressing the challenges associated with LRPLs.

Bridge-Coder: Unlocking LLMs' Potential to Overcome Language Gaps in Low-Resource Code

TL;DR

A novel approach called Bridge-Coder is introduced, which leverages LLMs' intrinsic capabilities to enhance the performance on LRPLs and applies the Bridged Alignment, which progressively improves the alignment between NL instructions and LRPLs.

Abstract

Large Language Models (LLMs) demonstrate strong proficiency in generating code for high-resource programming languages (HRPLs) like Python but struggle significantly with low-resource programming languages (LRPLs) such as Racket or D. This performance gap deepens the digital divide, preventing developers using LRPLs from benefiting equally from LLM advancements and reinforcing disparities in innovation within underrepresented programming communities. While generating additional training data for LRPLs is promising, it faces two key challenges: manual annotation is labor-intensive and costly, and LLM-generated LRPL code is often of subpar quality. The underlying cause of this issue is the gap between natural language to programming language gap (NL-PL Gap), which is especially pronounced in LRPLs due to limited aligned data. In this work, we introduce a novel approach called Bridge-Coder, which leverages LLMs' intrinsic capabilities to enhance the performance on LRPLs. Our method consists of two key stages. Bridge Generation, where we create high-quality dataset by utilizing LLMs' general knowledge understanding, proficiency in HRPLs, and in-context learning abilities. Then, we apply the Bridged Alignment, which progressively improves the alignment between NL instructions and LRPLs. Experimental results across multiple LRPLs show that Bridge-Coder significantly enhances model performance, demonstrating the effectiveness and generalization of our approach. Furthermore, we offer a detailed analysis of the key components of our method, providing valuable insights for future work aimed at addressing the challenges associated with LRPLs.

Paper Structure

This paper contains 40 sections, 2 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: An illustration of how code-bridge helps solve tasks in low-resource programming languages (LRPLs). Directly generating responses leads to suboptimal results, as models struggle to follow instructions accurately in LRPLs due to limited training data. In contrast, using code-bridge can improve performance by first generating code and comments in high-resource programming languages (HRPLs), using this output as a bridge to guide the model toward producing more accurate and coherent responses in LRPLs.
  • Figure 2: An illustration of Bridge-Coder. In Bridge-Assisted Generation, the LLM first identifies tasks suitable for the target low-resource programming language (LRPL). Then, it generates a code-bridge in a high-resource programming language (HRPL), combining both code and comments to explain the solution. This code-bridge is then used to help bridge the NL-PL gap in LRPLs. In Bridged Alignment, the model is first guided by the code-bridge to assist in aligning the NL-PL gap, and later progresses to generating responses directly from natural language instructions without the code-bridge.
  • Figure 3: Ablation of Task Screening.