Table of Contents
Fetching ...

Merino: Entropy-driven Design for Generative Language Models on IoT Devices

Youpeng Zhao, Ming Lin, Huadong Tang, Qiang Wu, Jun Wang

TL;DR

Addressing the challenge of deploying generative LLMs on IoT devices, the paper designs sub-100M-parameter language models that preserve competitive accuracy while meeting edge-device constraints. It introduces an entropy-driven framework based on the Maximum Entropy Principle, formulating a constrained mathematical program to maximize transformer entropy under budgets and solving it with an Evolutionary Algorithm, enabling near-zero-cost searches on target hardware. A fast entropy approximation via a lookup table and an adaptive block-wise transformer with parameter sharing yield MeRino variants suitable for on-device deployment, achieving substantial speedups and size reductions. Empirical results across fourteen NLP tasks show MeRino matching or surpassing OPT-350M with a 5.5x reduction in parameter count and 4.9x faster latency on NVIDIA Jetson Nano, validating the practicality of entropy-driven edge design.

Abstract

Generative Large Language Models (LLMs) stand as a revolutionary advancement in the modern era of artificial intelligence (AI). However, scaling down LLMs for resource-constrained hardware, such as Internet-of-Things (IoT) devices requires non-trivial efforts and domain knowledge. In this paper, we propose a novel information-entropy framework for designing mobile-friendly generative language models. The whole design procedure involves solving a mathematical programming (MP) problem, which can be done on the CPU within minutes, making it nearly zero-cost. We evaluate our designed models, termed MeRino, across fourteen NLP downstream tasks, showing their competitive performance against the state-of-the-art autoregressive transformer models under the mobile setting. Notably, MeRino achieves similar or better performance on both language modeling and zero-shot learning tasks, compared to the 350M parameter OPT while being 4.9x faster on NVIDIA Jetson Nano with 5.5x reduction in model size.

Merino: Entropy-driven Design for Generative Language Models on IoT Devices

TL;DR

Addressing the challenge of deploying generative LLMs on IoT devices, the paper designs sub-100M-parameter language models that preserve competitive accuracy while meeting edge-device constraints. It introduces an entropy-driven framework based on the Maximum Entropy Principle, formulating a constrained mathematical program to maximize transformer entropy under budgets and solving it with an Evolutionary Algorithm, enabling near-zero-cost searches on target hardware. A fast entropy approximation via a lookup table and an adaptive block-wise transformer with parameter sharing yield MeRino variants suitable for on-device deployment, achieving substantial speedups and size reductions. Empirical results across fourteen NLP tasks show MeRino matching or surpassing OPT-350M with a 5.5x reduction in parameter count and 4.9x faster latency on NVIDIA Jetson Nano, validating the practicality of entropy-driven edge design.

Abstract

Generative Large Language Models (LLMs) stand as a revolutionary advancement in the modern era of artificial intelligence (AI). However, scaling down LLMs for resource-constrained hardware, such as Internet-of-Things (IoT) devices requires non-trivial efforts and domain knowledge. In this paper, we propose a novel information-entropy framework for designing mobile-friendly generative language models. The whole design procedure involves solving a mathematical programming (MP) problem, which can be done on the CPU within minutes, making it nearly zero-cost. We evaluate our designed models, termed MeRino, across fourteen NLP downstream tasks, showing their competitive performance against the state-of-the-art autoregressive transformer models under the mobile setting. Notably, MeRino achieves similar or better performance on both language modeling and zero-shot learning tasks, compared to the 350M parameter OPT while being 4.9x faster on NVIDIA Jetson Nano with 5.5x reduction in model size.
Paper Structure (18 sections, 12 equations, 6 figures, 8 tables, 2 algorithms)

This paper contains 18 sections, 12 equations, 6 figures, 8 tables, 2 algorithms.

Figures (6)

  • Figure 1: Average zero-shot accuracy and inference latency on NVIDIA Jetson Nano for mobile-level LLMs. Results were evaluated using lm-evaluation-harness lm-eval on open-sourced pre-trained models. The diameter of each circle denotes the corresponding model FLOPs.
  • Figure 2: Our entropy estimation based on table lookup is very accurate, with an average error rate of 0.03%.
  • Figure 3: Correlation comparison of different training-free predictors, e.g., NTK tenas, DSS-Score transformerfree, and Decoder-Param lts, and transformer performance (negative perplexity, higher is better). $\rho$ is Spearman's Rank and $\tau$ is Kendall Tau. Larger values mean higher correlation.
  • Figure 4: Our proposed adaptive block-wise transformer design. Left is the standard autoregressive transformer design, which consists of $L$ homogeneous layers, and right is the optimal architecture design after entropy maximization, where there are $N$ number of transformer blocks and each transformer block has adaptive width (${E_i,R_i}$) and depth ($L_i$).
  • Figure 5: Performance comparison of MeRino, NAS-based methods, and naive scaling methods.
  • ...and 1 more figures