Merino: Entropy-driven Design for Generative Language Models on IoT Devices
Youpeng Zhao, Ming Lin, Huadong Tang, Qiang Wu, Jun Wang
TL;DR
Addressing the challenge of deploying generative LLMs on IoT devices, the paper designs sub-100M-parameter language models that preserve competitive accuracy while meeting edge-device constraints. It introduces an entropy-driven framework based on the Maximum Entropy Principle, formulating a constrained mathematical program to maximize transformer entropy under budgets and solving it with an Evolutionary Algorithm, enabling near-zero-cost searches on target hardware. A fast entropy approximation via a lookup table and an adaptive block-wise transformer with parameter sharing yield MeRino variants suitable for on-device deployment, achieving substantial speedups and size reductions. Empirical results across fourteen NLP tasks show MeRino matching or surpassing OPT-350M with a 5.5x reduction in parameter count and 4.9x faster latency on NVIDIA Jetson Nano, validating the practicality of entropy-driven edge design.
Abstract
Generative Large Language Models (LLMs) stand as a revolutionary advancement in the modern era of artificial intelligence (AI). However, scaling down LLMs for resource-constrained hardware, such as Internet-of-Things (IoT) devices requires non-trivial efforts and domain knowledge. In this paper, we propose a novel information-entropy framework for designing mobile-friendly generative language models. The whole design procedure involves solving a mathematical programming (MP) problem, which can be done on the CPU within minutes, making it nearly zero-cost. We evaluate our designed models, termed MeRino, across fourteen NLP downstream tasks, showing their competitive performance against the state-of-the-art autoregressive transformer models under the mobile setting. Notably, MeRino achieves similar or better performance on both language modeling and zero-shot learning tasks, compared to the 350M parameter OPT while being 4.9x faster on NVIDIA Jetson Nano with 5.5x reduction in model size.
