Table of Contents
Fetching ...

LLM-FSM: Scaling Large Language Models for Finite-State Reasoning in RTL Code Generation

Yuheng Wu, Berk Gokmen, Zhouhua Xie, Peijing Li, Caroline Trippel, Priyanka Raina, Thierry Tambe

TL;DR

LLM-FSM introduces a scalable, automated benchmark for NL specification-to-RTL generation focused on finite-state reasoning in RTL design. It builds 1000 problems via an end-to-end pipeline that starts from abstract FSM topology, enriches it with semantic YAML via LLMs, synthesizes reference RTL and testbenches, then generates NL specifications and performs SAT-based equivalence checks to filter valid instances. The study reveals that even strong LLMs struggle as FSM complexity grows, but training-time scaling through supervised fine-tuning and multi-trace test-time sampling improve robustness and generalization, with strong correlations to human-written RTL benchmarks. The framework’s automatic generation, formal verification, and extensible topology allow scalable, realistic assessment of FSM reasoning in LLM-driven RTL synthesis, aiding future improvements in hardware-aware language models.

Abstract

Finite-state reasoning, the ability to understand and implement state-dependent behavior, is central to hardware design. In this paper, we present LLM-FSM, a benchmark that evaluates how well large language models (LLMs) can recover finite-state machine (FSM) behavior from natural-language specifications and translate it into correct register transfer-level (RTL) implementations. Unlike prior specification-to-RTL benchmarks that rely on manually constructed examples, LLM-FSM is built through a fully automated pipeline. LLM-FSM first constructs FSM with configurable state counts and constrained transition structures. It then prompts LLMs to express each FSM in a structured YAML format with an application context, and to further convert that YAML into a natural-language (NL) specification. From the same YAML, our pipeline synthesizes the reference RTL and testbench in a correct-by-construction manner. All 1,000 problems are verified using LLM-based and SAT-solver-based checks, with human review on a subset. Our experiments show that even the strongest LLMs exhibit sharply declining accuracy as FSM complexity increases. We further demonstrate that training-time scaling via supervised fine-tuning (SFT) generalizes effectively to out-of-distribution (OOD) tasks, while increasing test-time compute improves reasoning reliability. Finally, LLM-FSM remains extensible by allowing its FSM complexity to scale with future model capabilities.

LLM-FSM: Scaling Large Language Models for Finite-State Reasoning in RTL Code Generation

TL;DR

LLM-FSM introduces a scalable, automated benchmark for NL specification-to-RTL generation focused on finite-state reasoning in RTL design. It builds 1000 problems via an end-to-end pipeline that starts from abstract FSM topology, enriches it with semantic YAML via LLMs, synthesizes reference RTL and testbenches, then generates NL specifications and performs SAT-based equivalence checks to filter valid instances. The study reveals that even strong LLMs struggle as FSM complexity grows, but training-time scaling through supervised fine-tuning and multi-trace test-time sampling improve robustness and generalization, with strong correlations to human-written RTL benchmarks. The framework’s automatic generation, formal verification, and extensible topology allow scalable, realistic assessment of FSM reasoning in LLM-driven RTL synthesis, aiding future improvements in hardware-aware language models.

Abstract

Finite-state reasoning, the ability to understand and implement state-dependent behavior, is central to hardware design. In this paper, we present LLM-FSM, a benchmark that evaluates how well large language models (LLMs) can recover finite-state machine (FSM) behavior from natural-language specifications and translate it into correct register transfer-level (RTL) implementations. Unlike prior specification-to-RTL benchmarks that rely on manually constructed examples, LLM-FSM is built through a fully automated pipeline. LLM-FSM first constructs FSM with configurable state counts and constrained transition structures. It then prompts LLMs to express each FSM in a structured YAML format with an application context, and to further convert that YAML into a natural-language (NL) specification. From the same YAML, our pipeline synthesizes the reference RTL and testbench in a correct-by-construction manner. All 1,000 problems are verified using LLM-based and SAT-solver-based checks, with human review on a subset. Our experiments show that even the strongest LLMs exhibit sharply declining accuracy as FSM complexity increases. We further demonstrate that training-time scaling via supervised fine-tuning (SFT) generalizes effectively to out-of-distribution (OOD) tasks, while increasing test-time compute improves reasoning reliability. Finally, LLM-FSM remains extensible by allowing its FSM complexity to scale with future model capabilities.
Paper Structure (41 sections, 4 equations, 5 figures, 5 tables)

This paper contains 41 sections, 4 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of the LLM-FSM data curation pipeline. The process begins by constructing an abstract FSM graph, followed by LLM-based specification generation, automatic RTL and testbench synthesis, and isomorphism/equivalence check.
  • Figure 2: Overview of the LLM-FSM evaluation pipeline. An NL specification is processed through three tool-chain settings: Specification$\rightarrow$RTL, Specification$\rightarrow$YAML$\rightarrow$RTL, and Specification$\rightarrow$SystemC. Each model prediction is executed under the same reference testbench, and correctness is determined by cycle-by-cycle output matching against the reference RTL.
  • Figure 3: An example illustrating the generation process. The abstract graph is first sampled topologically, and an LLM then assigns semantics, here producing a Quad-SPI burst-read controller for a NOR-flash device.
  • Figure 4: Scaling and difficulty analysis on LLM-FSM. Left: scaling behavior of different model families. Right: accuracy averaged across all models within each difficulty bin.
  • Figure 5: TTS for finite-state reasoning. Left: multi-trace TTS pass@k scaling on LLM-FSM. Right: comparison of single-trace TTS vs multi-trace TTS at $k=16$.