Merino: Entropy-driven Design for Generative Language Models on IoT Devices

Youpeng Zhao; Ming Lin; Huadong Tang; Qiang Wu; Jun Wang

Merino: Entropy-driven Design for Generative Language Models on IoT Devices

Youpeng Zhao, Ming Lin, Huadong Tang, Qiang Wu, Jun Wang

TL;DR

Addressing the challenge of deploying generative LLMs on IoT devices, the paper designs sub-100M-parameter language models that preserve competitive accuracy while meeting edge-device constraints. It introduces an entropy-driven framework based on the Maximum Entropy Principle, formulating a constrained mathematical program to maximize transformer entropy under budgets and solving it with an Evolutionary Algorithm, enabling near-zero-cost searches on target hardware. A fast entropy approximation via a lookup table and an adaptive block-wise transformer with parameter sharing yield MeRino variants suitable for on-device deployment, achieving substantial speedups and size reductions. Empirical results across fourteen NLP tasks show MeRino matching or surpassing OPT-350M with a 5.5x reduction in parameter count and 4.9x faster latency on NVIDIA Jetson Nano, validating the practicality of entropy-driven edge design.

Abstract

Generative Large Language Models (LLMs) stand as a revolutionary advancement in the modern era of artificial intelligence (AI). However, scaling down LLMs for resource-constrained hardware, such as Internet-of-Things (IoT) devices requires non-trivial efforts and domain knowledge. In this paper, we propose a novel information-entropy framework for designing mobile-friendly generative language models. The whole design procedure involves solving a mathematical programming (MP) problem, which can be done on the CPU within minutes, making it nearly zero-cost. We evaluate our designed models, termed MeRino, across fourteen NLP downstream tasks, showing their competitive performance against the state-of-the-art autoregressive transformer models under the mobile setting. Notably, MeRino achieves similar or better performance on both language modeling and zero-shot learning tasks, compared to the 350M parameter OPT while being 4.9x faster on NVIDIA Jetson Nano with 5.5x reduction in model size.

Merino: Entropy-driven Design for Generative Language Models on IoT Devices

TL;DR

Abstract

Paper Structure (18 sections, 12 equations, 6 figures, 8 tables, 2 algorithms)

This paper contains 18 sections, 12 equations, 6 figures, 8 tables, 2 algorithms.

Introduction
Related Work
Methodology
Preliminaries
Entropy of Neural Network
Effectiveness of Neural Network
Entropy of Transformers
Designing Mobile Language Models
Experiments
Experimental Settings
Main Results
Ablation Study
Conclusion
Acknowledgment
Appendix
...and 3 more sections

Figures (6)

Figure 1: Average zero-shot accuracy and inference latency on NVIDIA Jetson Nano for mobile-level LLMs. Results were evaluated using lm-evaluation-harness lm-eval on open-sourced pre-trained models. The diameter of each circle denotes the corresponding model FLOPs.
Figure 2: Our entropy estimation based on table lookup is very accurate, with an average error rate of 0.03%.
Figure 3: Correlation comparison of different training-free predictors, e.g., NTK tenas, DSS-Score transformerfree, and Decoder-Param lts, and transformer performance (negative perplexity, higher is better). $\rho$ is Spearman's Rank and $\tau$ is Kendall Tau. Larger values mean higher correlation.
Figure 4: Our proposed adaptive block-wise transformer design. Left is the standard autoregressive transformer design, which consists of $L$ homogeneous layers, and right is the optimal architecture design after entropy maximization, where there are $N$ number of transformer blocks and each transformer block has adaptive width (${E_i,R_i}$) and depth ($L_i$).
Figure 5: Performance comparison of MeRino, NAS-based methods, and naive scaling methods.
...and 1 more figures

Merino: Entropy-driven Design for Generative Language Models on IoT Devices

TL;DR

Abstract

Merino: Entropy-driven Design for Generative Language Models on IoT Devices

Authors

TL;DR

Abstract

Table of Contents

Figures (6)