CreativEval: Evaluating Creativity of LLM-Based Hardware Code Generation

Matthew DeLorenzo; Vasudev Gohil; Jeyavijayan Rajendran

CreativEval: Evaluating Creativity of LLM-Based Hardware Code Generation

Matthew DeLorenzo, Vasudev Gohil, Jeyavijayan Rajendran

TL;DR

This work addresses the gap in evaluating creativity within LLM-driven hardware code generation by introducing CreativEval, a framework that quantifies four subcomponents of creativity—$\text{fluency}$, $\text{flexibility}$, $\text{originality}$, and $\text{elaboration}$—through targeted prompts and post-processing using the GNN4IP similarity metric. It systematically assesses multiple prominent LLMs across HDLBits-Verilog prompts and elaboration tasks, producing a composite creativity score $C = 0.25F + 0.25X + 0.25O + 0.25E$. The experimental results indicate that GPT-3.5 generally exhibits the highest creativity among the models tested, while CodeLlama variants trail in several subcategories, and larger models do not necessarily deliver higher creativity. By providing an open-source framework and datasets, this work enables broader benchmarking of creativity in hardware design with LLMs and highlights the importance of moving beyond functional correctness to capture innovative design capabilities in practice.

Abstract

Large Language Models (LLMs) have proved effective and efficient in generating code, leading to their utilization within the hardware design process. Prior works evaluating LLMs' abilities for register transfer level code generation solely focus on functional correctness. However, the creativity associated with these LLMs, or the ability to generate novel and unique solutions, is a metric not as well understood, in part due to the challenge of quantifying this quality. To address this research gap, we present CreativeEval, a framework for evaluating the creativity of LLMs within the context of generating hardware designs. We quantify four creative sub-components, fluency, flexibility, originality, and elaboration, through various prompting and post-processing techniques. We then evaluate multiple popular LLMs (including GPT models, CodeLlama, and VeriGen) upon this creativity metric, with results indicating GPT-3.5 as the most creative model in generating hardware designs.

CreativEval: Evaluating Creativity of LLM-Based Hardware Code Generation

TL;DR

This work addresses the gap in evaluating creativity within LLM-driven hardware code generation by introducing CreativEval, a framework that quantifies four subcomponents of creativity—

, and

—through targeted prompts and post-processing using the GNN4IP similarity metric. It systematically assesses multiple prominent LLMs across HDLBits-Verilog prompts and elaboration tasks, producing a composite creativity score

. The experimental results indicate that GPT-3.5 generally exhibits the highest creativity among the models tested, while CodeLlama variants trail in several subcategories, and larger models do not necessarily deliver higher creativity. By providing an open-source framework and datasets, this work enables broader benchmarking of creativity in hardware design with LLMs and highlights the importance of moving beyond functional correctness to capture innovative design capabilities in practice.

Abstract

Paper Structure (14 sections, 6 equations, 1 figure, 1 table)

This paper contains 14 sections, 6 equations, 1 figure, 1 table.

Introduction
Background and Related Work
LLMs for Code Generation and Hardware Design
Evaluating Creativity
CreativEval Framework
Fluency
Flexibility
Originality
Elaboration
Creativity: Putting It All Together
Experimental Evaluation
Experimental Setup
Results
Conclusion

Figures (1)

Figure 1: Experimental Framework - calculating creativity of LLMs in Verilog code generation.

CreativEval: Evaluating Creativity of LLM-Based Hardware Code Generation

TL;DR

Abstract

CreativEval: Evaluating Creativity of LLM-Based Hardware Code Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (1)