Table of Contents
Fetching ...

CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance

Myeongsoo Kim, Shweta Garg, Baishakhi Ray, Varun Kumar, Anoop Deoras

TL;DR

CodeAssistBench (CAB) tackles the gap in realistic programming-assistance benchmarks by auto constructing multi-turn, environment-grounded tasks from real GitHub issues and evaluating LLMs in fully simulated maintainer–user dialogues. The approach combines automated Dockerized build environments, satisfaction-condition extraction, and a judge-driven scoring protocol to measure not only correctness but also conversation quality and practicality. Across $3{,}286$ issues from $214$ repositories in seven languages, results reveal a persistent gap between traditional Q&A performance (often above $70\%$ accuracy) and real-world project-specific support (frequently below $20\%$ accuracy), especially in post-cutoff data. CAB thus provides a scalable, reproducible benchmark for advancing multi-turn programming assistants and guiding improvements in environment-aware, user-centric code assistance.

Abstract

Programming assistants powered by large language models have improved dramatically, yet existing benchmarks still evaluate them in narrow code-generation settings. Recent efforts such as InfiBench and StackEval rely on Stack Overflow questions and remain limited to single-turn interactions, manually curated data, and isolated snippets rather than full project environments. We introduce CodeAssistBench (CAB), the first benchmark for evaluating multi-turn, project-grounded programming assistance at scale. CAB automatically constructs datasets from GitHub issues tagged as questions, using an LLM-driven pipeline that filters noise, extracts runnable contexts, builds executable containers, and verifies environment correctness. This enables continuous, automated expansion across diverse repositories without manual intervention. Using CAB, we create a testbed of 3,286 real-world issues across 214 repositories, spanning seven languages. Evaluating state-of-the-art models reveals a substantial gap: while models achieve 70-83% accuracy on Stack Overflow-style questions, they solve only 16.49% of CAB issues from post-training-cutoff repositories. On a manually validated subset of 149 issues, top models such as Claude Sonnet 4.5 reach only 12.08% correctness. These results highlight a fundamental challenge: current LLMs struggle to provide assistance in realistic, project-specific contexts despite strong performance on traditional Q&A benchmarks. CAB provides a scalable, reproducible framework for advancing research in multi-turn, codebase-grounded programming agents. The benchmark and pipeline are fully automated and publicly available at https://github.com/amazon-science/CodeAssistBench/.

CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance

TL;DR

CodeAssistBench (CAB) tackles the gap in realistic programming-assistance benchmarks by auto constructing multi-turn, environment-grounded tasks from real GitHub issues and evaluating LLMs in fully simulated maintainer–user dialogues. The approach combines automated Dockerized build environments, satisfaction-condition extraction, and a judge-driven scoring protocol to measure not only correctness but also conversation quality and practicality. Across issues from repositories in seven languages, results reveal a persistent gap between traditional Q&A performance (often above accuracy) and real-world project-specific support (frequently below accuracy), especially in post-cutoff data. CAB thus provides a scalable, reproducible benchmark for advancing multi-turn programming assistants and guiding improvements in environment-aware, user-centric code assistance.

Abstract

Programming assistants powered by large language models have improved dramatically, yet existing benchmarks still evaluate them in narrow code-generation settings. Recent efforts such as InfiBench and StackEval rely on Stack Overflow questions and remain limited to single-turn interactions, manually curated data, and isolated snippets rather than full project environments. We introduce CodeAssistBench (CAB), the first benchmark for evaluating multi-turn, project-grounded programming assistance at scale. CAB automatically constructs datasets from GitHub issues tagged as questions, using an LLM-driven pipeline that filters noise, extracts runnable contexts, builds executable containers, and verifies environment correctness. This enables continuous, automated expansion across diverse repositories without manual intervention. Using CAB, we create a testbed of 3,286 real-world issues across 214 repositories, spanning seven languages. Evaluating state-of-the-art models reveals a substantial gap: while models achieve 70-83% accuracy on Stack Overflow-style questions, they solve only 16.49% of CAB issues from post-training-cutoff repositories. On a manually validated subset of 149 issues, top models such as Claude Sonnet 4.5 reach only 12.08% correctness. These results highlight a fundamental challenge: current LLMs struggle to provide assistance in realistic, project-specific contexts despite strong performance on traditional Q&A benchmarks. CAB provides a scalable, reproducible framework for advancing research in multi-turn, codebase-grounded programming agents. The benchmark and pipeline are fully automated and publicly available at https://github.com/amazon-science/CodeAssistBench/.

Paper Structure

This paper contains 75 sections, 4 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: A port-mapping clarification from a real GitHub issue.
  • Figure 2: CAB’s automated data generation pipeline. It collects relevant GitHub repositories, filters issue conversations, and produces structured assistance scenarios with build environments, satisfaction conditions, and reference user responses.
  • Figure 3: CAB evaluation pipeline. A simulated user chats with the Maintainer Agent, which can run code in an optional build sandbox; the interaction continues until the user is satisfied or reaches the maximum turn limit. Once the dialogue ends, an LLM-judge grades the exchange against satisfaction conditions extracted from the original GitHub issue. This pipeline enables realistic assessment of programming assistance in context-rich, project-specific scenarios.
  • Figure 4: Side-by-side comparison of model responses to a Docker port-remapping issue: Haiku 3.5's incomplete solution (middle) fails to address key requirements, while ChatGPT 4.1 Mini's successful response (right) satisfies all three user conditions (highlighted in colored boxes).
  • Figure 5: Distribution of GitHub issue conversation lengths by programming language. Each turn corresponds to a maintainer response; for example, a 1-turn conversation consists of a user question and a single maintainer reply, while longer conversations reflect additional back-and-forth exchanges.
  • ...and 2 more figures