Table of Contents
Fetching ...

ThrowBench: Benchmarking LLMs by Predicting Runtime Exceptions

Julian Aron Prenner, Romain Robbes

TL;DR

ThrowBench addresses the gap in evaluating LLMs by predicting runtime exceptions rather than just code generation. It constructs a multilingual, execution-grounded benchmark from RunBugRun, containing 2,466 buggy programs across four languages and 37 exception types, and evaluates six open-weight LLMs using a fixed set of exception options. The results reveal modest overall performance (best F1 around 0.38) with substantial variation by language and exception type, underscoring room for improvement and the value of runtime-focused evaluation. The work provides a contamination-resistant, publicly available benchmark to broaden assessment of code understanding in LLMs and to guide future improvements in runtime semantics prediction.

Abstract

Modern Large Language Models (LLMs) have shown astounding capabilities of code understanding and synthesis. In order to assess such capabilities, several benchmarks have been devised (e.g., HumanEval). However, most benchmarks focus on code synthesis from natural language instructions. Hence, such benchmarks do not test for other forms of code understanding. Moreover, there have been concerns about contamination and leakage. That is, benchmark problems (or closely related problems) may appear in training set, strongly biasing benchmark results. In this work we investigate whether large language models can correctly predict runtime program behavior. To this end, we introduce ThrowBench, a benchmark consisting of over 2,400 short user-written programs written in four different programming languages. The majority of these programs throw an exception during runtime (due to a bug). LLMs are asked to predict whether a presented program throws an exception and, if so, which one. Evaluating our benchmark on six state-of-the-art code LLMs we see modest performance ranging from 19 to 38% (F1 score). Benchmarking a wider set of code capabilities could improve the assessment of code LLMs and help identify weak points in current models. Moreover, as ground-truth answers have been determined through program execution, leakage is not a concern. We release ThrowBench as well as all of our results together with this work.

ThrowBench: Benchmarking LLMs by Predicting Runtime Exceptions

TL;DR

ThrowBench addresses the gap in evaluating LLMs by predicting runtime exceptions rather than just code generation. It constructs a multilingual, execution-grounded benchmark from RunBugRun, containing 2,466 buggy programs across four languages and 37 exception types, and evaluates six open-weight LLMs using a fixed set of exception options. The results reveal modest overall performance (best F1 around 0.38) with substantial variation by language and exception type, underscoring room for improvement and the value of runtime-focused evaluation. The work provides a contamination-resistant, publicly available benchmark to broaden assessment of code understanding in LLMs and to guide future improvements in runtime semantics prediction.

Abstract

Modern Large Language Models (LLMs) have shown astounding capabilities of code understanding and synthesis. In order to assess such capabilities, several benchmarks have been devised (e.g., HumanEval). However, most benchmarks focus on code synthesis from natural language instructions. Hence, such benchmarks do not test for other forms of code understanding. Moreover, there have been concerns about contamination and leakage. That is, benchmark problems (or closely related problems) may appear in training set, strongly biasing benchmark results. In this work we investigate whether large language models can correctly predict runtime program behavior. To this end, we introduce ThrowBench, a benchmark consisting of over 2,400 short user-written programs written in four different programming languages. The majority of these programs throw an exception during runtime (due to a bug). LLMs are asked to predict whether a presented program throws an exception and, if so, which one. Evaluating our benchmark on six state-of-the-art code LLMs we see modest performance ranging from 19 to 38% (F1 score). Benchmarking a wider set of code capabilities could improve the assessment of code LLMs and help identify weak points in current models. Moreover, as ground-truth answers have been determined through program execution, leakage is not a concern. We release ThrowBench as well as all of our results together with this work.

Paper Structure

This paper contains 7 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Prompt used in our evaluation. Following common conventions, we use the verb raise for Python and Ruby, and throw for C# and Java. <CODE> is replaced with the example code, <INPUT> with the input that triggers the exception.
  • Figure 2: A simple Python example in ThrowBench. Independent of input, variable A is of type list and does not have a method islower, resulting in a AttributeError being thrown.
  • Figure 3: A simple Java example in ThrowBench. On input -1 0, an ArithmeticException is thrown. Note: imports, main class and main method omitted for brevity.