Deep-Bench: Deep Learning Benchmark Dataset for Code Generation

Alireza Daghighfarsoodeh; Chung-Yu Wang; Hamed Taherkhani; Melika Sepidband; Mohammad Abdollahi; Hadi Hemmati; Hung Viet Pham

Deep-Bench: Deep Learning Benchmark Dataset for Code Generation

Alireza Daghighfarsoodeh, Chung-Yu Wang, Hamed Taherkhani, Melika Sepidband, Mohammad Abdollahi, Hadi Hemmati, Hung Viet Pham

TL;DR

DeepBench tackles the challenge of evaluating large language models on deep-learning code generation by proposing a function-level benchmark that spans the full DL pipeline and multiple input data types. It combines GitHub-derived data with a two-stage process of raw data extraction and labeling, producing 520 entries categorized by pipeline phase, ML task, and input data type, and accompanied by ground-truth code and unit tests. An empirical study shows current SOTA LLMs struggle with DL-specific code, with GPT-4o achieving only 31% pass@1 on DeepBench (versus 60% on DS-1000), underscoring the benchmark's greater difficulty and its potential to reveal strengths and gaps in prompting and model capabilities. The authors also develop a taxonomy of DL-specific bugs in LLM-generated code, offering actionable insights for improving DL code generation and providing a publicly available resource for ongoing research.

Abstract

Deep learning (DL) has revolutionized areas such as computer vision, natural language processing, and more. However, developing DL systems is challenging due to the complexity of DL workflows. Large Language Models (LLMs), such as GPT, Claude, Llama, Mistral, etc., have emerged as promising tools to assist in DL code generation, offering potential solutions to these challenges. Despite this, existing benchmarks such as DS-1000 are limited, as they primarily focus on small DL code snippets related to pre/post-processing tasks and lack a comprehensive coverage of the full DL pipeline, including different DL phases and input data types. To address this, we introduce DeepBench, a novel benchmark dataset designed for function-level DL code generation. DeepBench categorizes DL problems based on three key aspects: phases such as pre-processing, model construction, and training; tasks, including classification, regression, and recommendation; and input data types such as tabular, image, and text. GPT-4o -- the state-of-the-art LLM -- achieved 31% accuracy on DeepBench, significantly lower than its 60% on DS-1000. We observed similar difficulty for other LLMs (e.g., 28% vs. 54% for Claude, 21% vs. 41% for LLaMA, and 15% vs. 20% for Mistral). This result underscores DeepBench's greater complexity. We also construct a taxonomy of issues and bugs found in LLM-generated DL code, which highlights the distinct challenges that LLMs face when generating DL code compared to general code. Furthermore, our analysis also reveals substantial performance variations across categories, with differences of up to 7% among phases and 37% among tasks. These disparities suggest that DeepBench offers valuable insights into the LLMs' performance and areas for potential improvement in the DL domain.

Deep-Bench: Deep Learning Benchmark Dataset for Code Generation

TL;DR

Abstract

Deep-Bench: Deep Learning Benchmark Dataset for Code Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)