Sim-is-More: Randomizing HW-NAS with Synthetic Devices
Francesco Capuano, Gabriele Tiboni, Niccolò Cavagnero, Giuseppe Averta
TL;DR
This paper tackles multi-device hardware-aware NAS by eliminating reliance on pre-deployment latency models. It introduces a two-stage framework where a controller is trained on synthetic device distributions and then deployed on a real target device with only a few high-fidelity latency measurements for adaptation. The approach leverages training-free accuracy proxies and domain randomization to enable cross-device generalization while keeping test-time costs low, demonstrated on the NATS-Bench space with limited real-world probes (as few as 10). By avoiding latency predictors and LUT-based estimates, the method offers a risk-aware, scalable path for deploying latency-efficient architectures across diverse hardware platforms.
Abstract
Existing hardware-aware NAS (HW-NAS) methods typically assume access to precise information circa the target device, either via analytical approximations of the post-compilation latency model, or through learned latency predictors. Such approximate approaches risk introducing estimation errors that may prove detrimental in risk-sensitive applications. In this work, we propose a two-stage HW-NAS framework, in which we first learn an architecture controller on a distribution of synthetic devices, and then directly deploy the controller on a target device. At test-time, our network controller deploys directly to the target device without relying on any pre-collected information, and only exploits direct interactions. In particular, the pre-training phase on synthetic devices enables the controller to design an architecture for the target device by interacting with it through a small number of high-fidelity latency measurements. To guarantee accessibility of our method, we only train our controller with training-free accuracy proxies, allowing us to scale the meta-training phase without incurring the overhead of full network training. We benchmark on HW-NATS-Bench, demonstrating that our method generalizes to unseen devices and searches for latency-efficient architectures by in-context adaptation using only a few real-world latency evaluations at test-time.
