H2LooP Spark Preview: Continual Pretraining of Large Language Models for Low-Level Embedded Systems Code

Amit Singh; Vedant Nipane; Pulkit Agrawal; Jatin Kishnani

H2LooP Spark Preview: Continual Pretraining of Large Language Models for Low-Level Embedded Systems Code

Amit Singh, Vedant Nipane, Pulkit Agrawal, Jatin Kishnani

Abstract

Large language models (LLMs) demonstrate strong code generation abilities in general-purpose programming languages but remain limited in specialized domains such as low-level embedded systems programming. This domain involves hardware register manipulation, vendor-specific SDKs, real-time operating system APIs, and hardware abstraction layers that are underrepresented in standard pretraining corpora. We introduce H2LooP Spark Preview, a continual pretraining (CPT) pipeline that adapts the OLMo-3-7B-a fully open language model to the embedded systems domain using BF16 LoRA with rank-stabilized scaling on 8 NVIDIA H100 GPUs. Our training corpus is constructed from repository-datasheet pairs covering 100B tokens of raw embedded systems data across 117 manufacturers, processed using the hierarchical datasheet-to-code mapping approach proposed in SpecMap (Nipane et al., 2026). The resulting curated dataset split contains 23.5B tokens across 13 embedded domains. Continual pretraining with high-rank LoRA (r=512) yields substantial gains, reducing in-domain perplexity by 70.4% and held-out repository perplexity by 66.1%. On generative code completion benchmarks spanning 13 embedded domains, our 7B model outperforms Claude Opus 4.6 and Qwen3-Coder-30B on 8 categories in token accuracy, showing that targeted continual pretraining enables smaller open-weight models to rival frontier systems on specialized technical tasks. We release the production training checkpoint on Huggingface as an open-source artifact.

H2LooP Spark Preview: Continual Pretraining of Large Language Models for Low-Level Embedded Systems Code

Abstract

Paper Structure (51 sections, 11 equations, 10 figures, 22 tables)

This paper contains 51 sections, 11 equations, 10 figures, 22 tables.

Introduction
Related Work
Continual Pretraining for Domain Adaptation
Parameter-Efficient Fine-Tuning
Code Language Models
Datasheet-to-Code Mapping for Training Data
Embedded Systems AI
Training Data
Data Sources and the Model Training Pool
Data Processing Pipeline
Corpus Statistics
Evaluation Data
Model and Training Methodology
Base Model
Parameter-Efficient Fine-Tuning Configuration
...and 36 more sections

Figures (10)

Figure 1: Dataset Composition --- Left: Training pool category distribution across 818 pairs. Right: Top 10 manufacturers by pair count.
Figure 2: Bayesian Hyperparameter Sweep: Training dynamics for 10 configurations. Each curve represents a unique (rank, target, LR) triple. Top-left: Training loss stratified by rank. Top-center: Learning rate schedules. Top-right: Post-clip gradient norms. Bottom: Global step and epoch progression. Labels encode rank ($r$), target (A = attention-only, F = full), and learning rate.
Figure 3: Training Loss --- Training cross-entropy loss as a function of optimizer steps for the hero run. The raw loss (light trace) and smoothed loss (bold trace, window=20) both exhibit monotonically decreasing behavior from 0.88 to 0.25 with no instabilities throughout the entire 294.7-hour run.
Figure 4: Hero Run LR Schedule and Token Accuracy --- Left: Cosine LR schedule with 10% warmup. Right: Mean token accuracy climbing monotonically from 86% to 93.5%.
Figure 5: Throughput --- Left: tokens per second, showing a sharp transition from $\sim$11,000 tok/s (steps 0-6,000) to $\sim$7,000 tok/s (steps 7,000+). Right: step time showing the corresponding increase from $\sim$45s to $\sim$70s per step.
...and 5 more figures

H2LooP Spark Preview: Continual Pretraining of Large Language Models for Low-Level Embedded Systems Code

Abstract

H2LooP Spark Preview: Continual Pretraining of Large Language Models for Low-Level Embedded Systems Code

Authors

Abstract

Table of Contents

Figures (10)