Table of Contents
Fetching ...

NALA_MAINZ at BLP-2025 Task 2: A Multi-agent Approach for Bangla Instruction to Python Code Generation

Hossain Shaikh Saadi, Faria Alam, Mario Sanz-Guerrero, Minh Duc Bui, Manuel Mager, Katharina von der Wense

TL;DR

The paper tackles Bangla instruction-to-Python code generation by introducing a two-agent pipeline where a code-generation agent proposes solutions and a debugger agent refines failures using error traces and unit tests. It leverages external and generated unit tests to broaden coverage and demonstrates substantial gains from test-driven feedback, achieving a top Pass@1 of 95.4% on Codabench. The study systematically analyzes overfitting risks, external data impact, generated test cases, and translation effects, highlighting practical improvements for code synthesis in an underserved language. The findings underscore the value of structured, test-driven refinement for improving functional correctness in language-diverse program synthesis with real-world applicability.

Abstract

This paper presents JGU Mainz's winning system for the BLP-2025 Shared Task on Code Generation from Bangla Instructions. We propose a multi-agent-based pipeline. First, a code-generation agent produces an initial solution from the input instruction. The candidate program is then executed against the provided unit tests (pytest-style, assert-based). Only the failing cases are forwarded to a debugger agent, which reruns the tests, extracts error traces, and, conditioning on the error messages, the current program, and the relevant test cases, generates a revised solution. Using this approach, our submission achieved first place in the shared task with a $Pass@1$ score of 95.4. We also make our code public.

NALA_MAINZ at BLP-2025 Task 2: A Multi-agent Approach for Bangla Instruction to Python Code Generation

TL;DR

The paper tackles Bangla instruction-to-Python code generation by introducing a two-agent pipeline where a code-generation agent proposes solutions and a debugger agent refines failures using error traces and unit tests. It leverages external and generated unit tests to broaden coverage and demonstrates substantial gains from test-driven feedback, achieving a top Pass@1 of 95.4% on Codabench. The study systematically analyzes overfitting risks, external data impact, generated test cases, and translation effects, highlighting practical improvements for code synthesis in an underserved language. The findings underscore the value of structured, test-driven refinement for improving functional correctness in language-diverse program synthesis with real-world applicability.

Abstract

This paper presents JGU Mainz's winning system for the BLP-2025 Shared Task on Code Generation from Bangla Instructions. We propose a multi-agent-based pipeline. First, a code-generation agent produces an initial solution from the input instruction. The candidate program is then executed against the provided unit tests (pytest-style, assert-based). Only the failing cases are forwarded to a debugger agent, which reruns the tests, extracts error traces, and, conditioning on the error messages, the current program, and the relevant test cases, generates a revised solution. Using this approach, our submission achieved first place in the shared task with a score of 95.4. We also make our code public.

Paper Structure

This paper contains 19 sections, 2 figures, 2 tables, 1 algorithm.

Figures (2)

  • Figure 1: Multi-agent Bangla$\to$Python code generation pipeline with selective debugging via unit test feedback.
  • Figure 2: Pass@1 of different models across both stages, with and without translation.