Table of Contents
Fetching ...

EmbedAgent: Benchmarking Large Language Models in Embedded System Development

Ruiyang Xu, Jialun Cao, Mingyuan Wu, Wenliang Zhong, Yaojie Lu, Ben He, Xianpei Han, Shing-Chi Cheung, Le Sun

TL;DR

This work introduces EmbedAgent and EmbedBench to evaluate large language models on end-to-end embedded system development, covering programming, circuit design, and cross-platform migration. Using a hardware-driven, automated evaluation in the Wokwi simulator, the authors test 10 LLMs across 126 tasks on three platforms (Arduino, ESP32, Raspberry Pi Pico), revealing that even strong models struggle (pass@1 around 55% with schematics) and particularly falter on ESP32 migrations. To address these gaps, they propose retrieval-augmented generation (R1-Retrieval) and compiler feedback (R1-Compiler), which yield notable gains (e.g., DeepSeek-R1 up to 65.1% pass@1 with correct schematics; ESP32 migration improves to 27.8%). The findings highlight both the current limits and the potential of LLMs for embodied hardware tasks, and they offer practical strategies and a reproducible benchmark to advance research in embedded-system AI.

Abstract

Large Language Models (LLMs) have shown promise in various tasks, yet few benchmarks assess their capabilities in embedded system development. In this paper, we introduce EmbedAgent, a paradigm designed to simulate real-world roles in embedded system development, such as Embedded System Programmer, Architect, and Integrator. This paradigm enables LLMs to be tested in tasks that bridge the gap between digital and physical systems, allowing for a more comprehensive assessment of their capabilities. To evaluate LLMs on these tasks, we propose Embedbench, the first comprehensive benchmark for embedded system programming, circuit design, and cross-platform migration. Embedbench consists of 126 cases, covering 9 electronic components across 3 hardware platforms. Through extensive experiments on 10 mainstream LLMs, we uncover several key findings. Surprisingly, despite the simplicity of the cases, DeepSeek-R1 achieves only a 55.6% pass@1 rate when provided with schematic information, and 50.0% when tasked with generating the schematics itself. In the cross-platform migration tasks, LLMs show relatively strong performance with MicroPython on the Raspberry Pi Pico (with the top model achieving 73.8% pass@1), but perform poorly on ESP-IDF, where the best model reaches only 29.4% pass@1. Interestingly, we observe that general-purpose chat LLMs like DeepSeek-V3 often fail to utilize relevant pre-trained knowledge in this domain, while reasoning LLMs tend to overthink and overlook efficient knowledge during pretraining. Based on these insights, we propose two strategies: retrieval augmented generation and compiler feedback-to enhance LLM performance. These strategies result in significant improvements, with Deepseek-R1 reaching a 65.1% pass@1 with correct schematics, and 53.1% without. Additionally, the accuracy of the Arduino to ESP32 migration task improves from 21.4% to 27.8%.

EmbedAgent: Benchmarking Large Language Models in Embedded System Development

TL;DR

This work introduces EmbedAgent and EmbedBench to evaluate large language models on end-to-end embedded system development, covering programming, circuit design, and cross-platform migration. Using a hardware-driven, automated evaluation in the Wokwi simulator, the authors test 10 LLMs across 126 tasks on three platforms (Arduino, ESP32, Raspberry Pi Pico), revealing that even strong models struggle (pass@1 around 55% with schematics) and particularly falter on ESP32 migrations. To address these gaps, they propose retrieval-augmented generation (R1-Retrieval) and compiler feedback (R1-Compiler), which yield notable gains (e.g., DeepSeek-R1 up to 65.1% pass@1 with correct schematics; ESP32 migration improves to 27.8%). The findings highlight both the current limits and the potential of LLMs for embodied hardware tasks, and they offer practical strategies and a reproducible benchmark to advance research in embedded-system AI.

Abstract

Large Language Models (LLMs) have shown promise in various tasks, yet few benchmarks assess their capabilities in embedded system development. In this paper, we introduce EmbedAgent, a paradigm designed to simulate real-world roles in embedded system development, such as Embedded System Programmer, Architect, and Integrator. This paradigm enables LLMs to be tested in tasks that bridge the gap between digital and physical systems, allowing for a more comprehensive assessment of their capabilities. To evaluate LLMs on these tasks, we propose Embedbench, the first comprehensive benchmark for embedded system programming, circuit design, and cross-platform migration. Embedbench consists of 126 cases, covering 9 electronic components across 3 hardware platforms. Through extensive experiments on 10 mainstream LLMs, we uncover several key findings. Surprisingly, despite the simplicity of the cases, DeepSeek-R1 achieves only a 55.6% pass@1 rate when provided with schematic information, and 50.0% when tasked with generating the schematics itself. In the cross-platform migration tasks, LLMs show relatively strong performance with MicroPython on the Raspberry Pi Pico (with the top model achieving 73.8% pass@1), but perform poorly on ESP-IDF, where the best model reaches only 29.4% pass@1. Interestingly, we observe that general-purpose chat LLMs like DeepSeek-V3 often fail to utilize relevant pre-trained knowledge in this domain, while reasoning LLMs tend to overthink and overlook efficient knowledge during pretraining. Based on these insights, we propose two strategies: retrieval augmented generation and compiler feedback-to enhance LLM performance. These strategies result in significant improvements, with Deepseek-R1 reaching a 65.1% pass@1 with correct schematics, and 53.1% without. Additionally, the accuracy of the Arduino to ESP32 migration task improves from 21.4% to 27.8%.

Paper Structure

This paper contains 26 sections, 3 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Three Settings of EmbedAgent: ➊ Embedded System Programmer: Given the task description and schematic of circuit, LLMs are expected to write Arduino code. ➋ Embedded System Architect: Given the task description, LLMs are expected to design the circuit and write the code. ➌ Embedded System Integrator: Given the schematic of circuit and code of one hardware, LLMs are expected to migrate the design of circuit and code to another hardware platform.
  • Figure 2: Arduino Workflow. The workflow requires a combination of hardware and software. On the hardware side, an Arduino board, electronic components, and a circuit schematic are needed. On the software side, code must be written based on the schematic and then uploaded to the board via USB port and a USB-to-serial chip, finally processed in the microcontroller.
  • Figure 3: Illustration of Wokwi wokwi - A Virtual Circuit Simulation. The upper part (i.e., Diagram) shows how the virtual circuit is denoted in Wokwi, the lower part (i.e., Code) shows how the Arduino code simulates the virtual circuit. Once the virtual circuit is activated, it will be virually powered. According to the Arduino code, the number shows in 7-segment display increase once the button has been pressed.
  • Figure 4: Data Preparation Pipeline
  • Figure 5: The average accuracy of reasoning LLMs (QwQ, DeepSeek-R1, O3-mini, Claude 3.7 Sonnet (Thinking)) on problems involving specific electronic components.
  • ...and 2 more figures