CRQBench: A Benchmark of Code Reasoning Questions
Elizabeth Dinella, Satish Chandra, Petros Maniatis
TL;DR
CRQBench presents a real-world, contextualized benchmark for evaluating code reasoning in LLMs by mining 100 C++ code reasoning Q&A tuples from code review comments. The authors introduce a cooperative LLM–human curation pipeline to convert comments into concise CRQs and evaluate GPT-4, achieving robust grounding in context with 65% accuracy. Key contributions include a Code Reasoning Classifier, a rephrasing pipeline (Edit Generator, Expression Extractor, Validator), and an analysis of manual-curation savings, alongside a candid assessment of limitations and context gaps in model performance. The work highlights the importance of realistic, context-rich benchmarks for semantic reasoning in code and provides a practical methodology for scalable curation and evaluation. The results suggest GPT-4 can ground answers in code context for a majority of questions, informing future improvements in semantic code understanding and benchmarking practices.
Abstract
Large Language Models have demonstrated exceptional proficiency on coding tasks, but it is challenging to precisely evaluate their code reasoning ability. Existing benchmarks are insufficient as they are unrealistic and conflate semantic reasoning ability with performance on software engineering tasks. We introduce CRQBench, a benchmark of 100 C++ code reasoning questions and answers derived from contextualized code review comments. To curate CRQBench, we use an LLM assistant alongside human inspection, reducing manual effort. We conduct an evaluation of GPT-4 on CRQBench and find that it produces correct responses grounded in the given context for 65 of the 100 questions.
