Table of Contents
Fetching ...

A WASM-Subset Stack Architecture for Low-cost FPGAs using Open-Source EDA Flows

Aradhya Chakrabarti

TL;DR

This work demonstrates a 32-bit WASM-like dual-stack soft-core implemented on a low-cost FPGA (Gowin GW1NR-9) using an open-source EDA flow and SPI Flash Execute-in-Place. It shows that distributed RAM can efficiently back small stacks, and introduces a 12-state FSM plus an ALU_WAIT mechanism to resolve race conditions, achieving about 27 MHz and 4–6 MIPS. Key contributions include a WASM-like ISA with a compact encoding, a two-pass JavaScript assembler, and a resource/stack-depth analysis that identifies 8-entry stacks as optimal for the target device. The case study with an infix calculator validates end-to-end functionality and emphasizes the practicality of open, transparent tooling for FPGA soft-cores in constrained environments.

Abstract

Soft-core processors on resource-constrained FPGAs often suffer from low code density and reliance on proprietary toolchains. This paper details the design, implementation, and evaluation of a 32-bit dual-stack microprocessor architecture optimized for low-cost, resource-constrained Field-Programmable Gate Arrays (FPGAs). Implemented on the Gowin GW1NR-9 (Tang Nano 9K), the processor utilizes an instruction set architecture (ISA) inspired from a subset of the WebAssembly (WASM) specification to achieve high code density. Unlike traditional soft-cores that often rely on proprietary vendor toolchains and opaque IP blocks, this design is synthesized and routed utilizing an open-source flow, providing transparency and portability. The architecture features a dual-stack model (Data and Return), executing directly from SPI Flash via an Execute-in-Place (XIP) mechanism to conserve scarce Block RAM on the intended target device. An analysis of the trade-offs involved in stack depth parametrization is presented, demonstrating that an 8-entry distributed RAM implementation provides a balance between logic resource utilization ($\sim 80\%$) and routing congestion. Furthermore, timing hazards in single-cycle stack operations are identified and resolved through a refined Finite State Machine (FSM) design. The system achieves a stable operating frequency of 27 MHz, limited by Flash latency, and successfully executes simple applications including a single and multi-digit infix calculator.

A WASM-Subset Stack Architecture for Low-cost FPGAs using Open-Source EDA Flows

TL;DR

This work demonstrates a 32-bit WASM-like dual-stack soft-core implemented on a low-cost FPGA (Gowin GW1NR-9) using an open-source EDA flow and SPI Flash Execute-in-Place. It shows that distributed RAM can efficiently back small stacks, and introduces a 12-state FSM plus an ALU_WAIT mechanism to resolve race conditions, achieving about 27 MHz and 4–6 MIPS. Key contributions include a WASM-like ISA with a compact encoding, a two-pass JavaScript assembler, and a resource/stack-depth analysis that identifies 8-entry stacks as optimal for the target device. The case study with an infix calculator validates end-to-end functionality and emphasizes the practicality of open, transparent tooling for FPGA soft-cores in constrained environments.

Abstract

Soft-core processors on resource-constrained FPGAs often suffer from low code density and reliance on proprietary toolchains. This paper details the design, implementation, and evaluation of a 32-bit dual-stack microprocessor architecture optimized for low-cost, resource-constrained Field-Programmable Gate Arrays (FPGAs). Implemented on the Gowin GW1NR-9 (Tang Nano 9K), the processor utilizes an instruction set architecture (ISA) inspired from a subset of the WebAssembly (WASM) specification to achieve high code density. Unlike traditional soft-cores that often rely on proprietary vendor toolchains and opaque IP blocks, this design is synthesized and routed utilizing an open-source flow, providing transparency and portability. The architecture features a dual-stack model (Data and Return), executing directly from SPI Flash via an Execute-in-Place (XIP) mechanism to conserve scarce Block RAM on the intended target device. An analysis of the trade-offs involved in stack depth parametrization is presented, demonstrating that an 8-entry distributed RAM implementation provides a balance between logic resource utilization () and routing congestion. Furthermore, timing hazards in single-cycle stack operations are identified and resolved through a refined Finite State Machine (FSM) design. The system achieves a stable operating frequency of 27 MHz, limited by Flash latency, and successfully executes simple applications including a single and multi-digit infix calculator.

Paper Structure

This paper contains 32 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: System Architecture utilizing the Tang Nano 9K resources. The CPU implements a Harvard-like split with instructions in Flash and data in SRAM.
  • Figure 2: Simplified FSM describing the fetch-decode-execute loop.
  • Figure 3: An Exemplar Visualization of the Data Stack during a PUSH 3 followed by an ADD operation. The Stack Pointer (SP) moves up and down as data is pushed and popped.
  • Figure 4: UART Transmission Timing Diagram for character 'A' (0x41) at 115200 baud. The signal is active low for the Start bit and logic 0 data bits.
  • Figure 5: Sample single-digit calculator execution output