Table of Contents
Fetching ...

H2LooP Spark Preview: Continual Pretraining of Large Language Models for Low-Level Embedded Systems Code

Amit Singh, Vedant Nipane, Pulkit Agrawal, Jatin Kishnani

Abstract

Large language models (LLMs) demonstrate strong code generation abilities in general-purpose programming languages but remain limited in specialized domains such as low-level embedded systems programming. This domain involves hardware register manipulation, vendor-specific SDKs, real-time operating system APIs, and hardware abstraction layers that are underrepresented in standard pretraining corpora. We introduce H2LooP Spark Preview, a continual pretraining (CPT) pipeline that adapts the OLMo-3-7B-a fully open language model to the embedded systems domain using BF16 LoRA with rank-stabilized scaling on 8 NVIDIA H100 GPUs. Our training corpus is constructed from repository-datasheet pairs covering 100B tokens of raw embedded systems data across 117 manufacturers, processed using the hierarchical datasheet-to-code mapping approach proposed in SpecMap (Nipane et al., 2026). The resulting curated dataset split contains 23.5B tokens across 13 embedded domains. Continual pretraining with high-rank LoRA (r=512) yields substantial gains, reducing in-domain perplexity by 70.4% and held-out repository perplexity by 66.1%. On generative code completion benchmarks spanning 13 embedded domains, our 7B model outperforms Claude Opus 4.6 and Qwen3-Coder-30B on 8 categories in token accuracy, showing that targeted continual pretraining enables smaller open-weight models to rival frontier systems on specialized technical tasks. We release the production training checkpoint on Huggingface as an open-source artifact.

H2LooP Spark Preview: Continual Pretraining of Large Language Models for Low-Level Embedded Systems Code

Abstract

Large language models (LLMs) demonstrate strong code generation abilities in general-purpose programming languages but remain limited in specialized domains such as low-level embedded systems programming. This domain involves hardware register manipulation, vendor-specific SDKs, real-time operating system APIs, and hardware abstraction layers that are underrepresented in standard pretraining corpora. We introduce H2LooP Spark Preview, a continual pretraining (CPT) pipeline that adapts the OLMo-3-7B-a fully open language model to the embedded systems domain using BF16 LoRA with rank-stabilized scaling on 8 NVIDIA H100 GPUs. Our training corpus is constructed from repository-datasheet pairs covering 100B tokens of raw embedded systems data across 117 manufacturers, processed using the hierarchical datasheet-to-code mapping approach proposed in SpecMap (Nipane et al., 2026). The resulting curated dataset split contains 23.5B tokens across 13 embedded domains. Continual pretraining with high-rank LoRA (r=512) yields substantial gains, reducing in-domain perplexity by 70.4% and held-out repository perplexity by 66.1%. On generative code completion benchmarks spanning 13 embedded domains, our 7B model outperforms Claude Opus 4.6 and Qwen3-Coder-30B on 8 categories in token accuracy, showing that targeted continual pretraining enables smaller open-weight models to rival frontier systems on specialized technical tasks. We release the production training checkpoint on Huggingface as an open-source artifact.
Paper Structure (51 sections, 11 equations, 10 figures, 22 tables)

This paper contains 51 sections, 11 equations, 10 figures, 22 tables.

Figures (10)

  • Figure 1: Dataset Composition --- Left: Training pool category distribution across 818 pairs. Right: Top 10 manufacturers by pair count.
  • Figure 2: Bayesian Hyperparameter Sweep: Training dynamics for 10 configurations. Each curve represents a unique (rank, target, LR) triple. Top-left: Training loss stratified by rank. Top-center: Learning rate schedules. Top-right: Post-clip gradient norms. Bottom: Global step and epoch progression. Labels encode rank ($r$), target (A = attention-only, F = full), and learning rate.
  • Figure 3: Training Loss --- Training cross-entropy loss as a function of optimizer steps for the hero run. The raw loss (light trace) and smoothed loss (bold trace, window=20) both exhibit monotonically decreasing behavior from 0.88 to 0.25 with no instabilities throughout the entire 294.7-hour run.
  • Figure 4: Hero Run LR Schedule and Token Accuracy --- Left: Cosine LR schedule with 10% warmup. Right: Mean token accuracy climbing monotonically from 86% to 93.5%.
  • Figure 5: Throughput --- Left: tokens per second, showing a sharp transition from $\sim$11,000 tok/s (steps 0-6,000) to $\sim$7,000 tok/s (steps 7,000+). Right: step time showing the corresponding increase from $\sim$45s to $\sim$70s per step.
  • ...and 5 more figures