Table of Contents
Fetching ...

CreativEval: Evaluating Creativity of LLM-Based Hardware Code Generation

Matthew DeLorenzo, Vasudev Gohil, Jeyavijayan Rajendran

TL;DR

This work addresses the gap in evaluating creativity within LLM-driven hardware code generation by introducing CreativEval, a framework that quantifies four subcomponents of creativity—$\text{fluency}$, $\text{flexibility}$, $\text{originality}$, and $\text{elaboration}$—through targeted prompts and post-processing using the GNN4IP similarity metric. It systematically assesses multiple prominent LLMs across HDLBits-Verilog prompts and elaboration tasks, producing a composite creativity score $C = 0.25F + 0.25X + 0.25O + 0.25E$. The experimental results indicate that GPT-3.5 generally exhibits the highest creativity among the models tested, while CodeLlama variants trail in several subcategories, and larger models do not necessarily deliver higher creativity. By providing an open-source framework and datasets, this work enables broader benchmarking of creativity in hardware design with LLMs and highlights the importance of moving beyond functional correctness to capture innovative design capabilities in practice.

Abstract

Large Language Models (LLMs) have proved effective and efficient in generating code, leading to their utilization within the hardware design process. Prior works evaluating LLMs' abilities for register transfer level code generation solely focus on functional correctness. However, the creativity associated with these LLMs, or the ability to generate novel and unique solutions, is a metric not as well understood, in part due to the challenge of quantifying this quality. To address this research gap, we present CreativeEval, a framework for evaluating the creativity of LLMs within the context of generating hardware designs. We quantify four creative sub-components, fluency, flexibility, originality, and elaboration, through various prompting and post-processing techniques. We then evaluate multiple popular LLMs (including GPT models, CodeLlama, and VeriGen) upon this creativity metric, with results indicating GPT-3.5 as the most creative model in generating hardware designs.

CreativEval: Evaluating Creativity of LLM-Based Hardware Code Generation

TL;DR

This work addresses the gap in evaluating creativity within LLM-driven hardware code generation by introducing CreativEval, a framework that quantifies four subcomponents of creativity—, , , and —through targeted prompts and post-processing using the GNN4IP similarity metric. It systematically assesses multiple prominent LLMs across HDLBits-Verilog prompts and elaboration tasks, producing a composite creativity score . The experimental results indicate that GPT-3.5 generally exhibits the highest creativity among the models tested, while CodeLlama variants trail in several subcategories, and larger models do not necessarily deliver higher creativity. By providing an open-source framework and datasets, this work enables broader benchmarking of creativity in hardware design with LLMs and highlights the importance of moving beyond functional correctness to capture innovative design capabilities in practice.

Abstract

Large Language Models (LLMs) have proved effective and efficient in generating code, leading to their utilization within the hardware design process. Prior works evaluating LLMs' abilities for register transfer level code generation solely focus on functional correctness. However, the creativity associated with these LLMs, or the ability to generate novel and unique solutions, is a metric not as well understood, in part due to the challenge of quantifying this quality. To address this research gap, we present CreativeEval, a framework for evaluating the creativity of LLMs within the context of generating hardware designs. We quantify four creative sub-components, fluency, flexibility, originality, and elaboration, through various prompting and post-processing techniques. We then evaluate multiple popular LLMs (including GPT models, CodeLlama, and VeriGen) upon this creativity metric, with results indicating GPT-3.5 as the most creative model in generating hardware designs.
Paper Structure (14 sections, 6 equations, 1 figure, 1 table)

This paper contains 14 sections, 6 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: Experimental Framework - calculating creativity of LLMs in Verilog code generation.