Table of Contents
Fetching ...

TinyLLM: A Framework for Training and Deploying Language Models at the Edge Computers

Savitha Viswanadh Kandala, Pramuka Medaranga, Ambuj Varshney

TL;DR

TinyLLM tackles the challenge of deploying language-model capabilities on edge devices by advocating for and enabling training of compact, domain-specialized models (~30–124M parameters). The framework combines curated pre-training data, GPT-2–style architectures, and LoRA fine-tuning to deliver fast, private inference on edge hardware using GGUF/llama.cpp deployments. Empirical results across gesture, localization, and sensing datasets show that small, well-curated models can match or outperform larger models while requiring far fewer GPU hours and enabling real-time edge processing. This work demonstrates a practical pathway for edge-native intelligence in embedded sensing, reducing reliance on cloud-based inference and mitigating privacy and latency concerns.

Abstract

Language models have gained significant interest due to their general-purpose capabilities, which appear to emerge as models are scaled to increasingly larger parameter sizes. However, these large models impose stringent requirements on computing systems, necessitating significant memory and processing requirements for inference. This makes performing inference on mobile and edge devices challenging, often requiring invocating remotely-hosted models via network calls. Remote inference, in turn, introduces issues like latency, unreliable network connectivity, and privacy concerns. To address these challenges, we explored the possibility of deviating from the trend of increasing model size. Instead, we hypothesize that much smaller models (~30-120M parameters) can outperform their larger counterparts for specific tasks by carefully curating the data used for pre-training and fine-tuning. We investigate this within the context of deploying edge-device models to support sensing applications. We trained several foundational models through a systematic study and found that small models can run locally on edge devices, achieving high token rates and accuracy. Based on these findings, we developed a framework that allows users to train foundational models tailored to their specific applications and deploy them at the edge.

TinyLLM: A Framework for Training and Deploying Language Models at the Edge Computers

TL;DR

TinyLLM tackles the challenge of deploying language-model capabilities on edge devices by advocating for and enabling training of compact, domain-specialized models (~30–124M parameters). The framework combines curated pre-training data, GPT-2–style architectures, and LoRA fine-tuning to deliver fast, private inference on edge hardware using GGUF/llama.cpp deployments. Empirical results across gesture, localization, and sensing datasets show that small, well-curated models can match or outperform larger models while requiring far fewer GPU hours and enabling real-time edge processing. This work demonstrates a practical pathway for edge-native intelligence in embedded sensing, reducing reliance on cloud-based inference and mitigating privacy and latency concerns.

Abstract

Language models have gained significant interest due to their general-purpose capabilities, which appear to emerge as models are scaled to increasingly larger parameter sizes. However, these large models impose stringent requirements on computing systems, necessitating significant memory and processing requirements for inference. This makes performing inference on mobile and edge devices challenging, often requiring invocating remotely-hosted models via network calls. Remote inference, in turn, introduces issues like latency, unreliable network connectivity, and privacy concerns. To address these challenges, we explored the possibility of deviating from the trend of increasing model size. Instead, we hypothesize that much smaller models (~30-120M parameters) can outperform their larger counterparts for specific tasks by carefully curating the data used for pre-training and fine-tuning. We investigate this within the context of deploying edge-device models to support sensing applications. We trained several foundational models through a systematic study and found that small models can run locally on edge devices, achieving high token rates and accuracy. Based on these findings, we developed a framework that allows users to train foundational models tailored to their specific applications and deploy them at the edge.

Paper Structure

This paper contains 19 sections, 2 equations, 17 figures, 2 tables.

Figures (17)

  • Figure 1: An embedded application often involves sensors that collect environmental data, which is then communicated to an edge device. TinyLLM provides a framework for training foundational models tailored for edge deployment, enabling these models to support a variety of tasks. This work explores training custom foundational models to enhance sensor data analysis. Our approach demonstrates a significantly smaller parameter-sized model than state-of-the-art language models, facilitating high-accuracy sensor data analysis while enabling rapid, local inference on even a constrained edge platform.
  • Figure 2: TinyLLM trains a custom foundational model for deployment at the edge device following a series of steps. It begins by appending a curated dataset with general conversational data. After pre-processing, the dataset is tokenized to pre-train a small model (30-120M parameter). The pre-trained model undergoes fine-tuning with the custom dataset before deployment on the edge device to support embedded applications.
  • Figure 3: Processing the dataset is essential for effective pre-training. This step addresses the challenges posed by the dataset’s diversity, ensures alignment of the dataset with the model’s context window size limitations, and formats the data appropriately for its usage with the subsequent training process.
  • Figure 4: The high-level representation of the architecture for the model used in this work is based on the GPT-2. The model architecture consists of l transformer blocks
  • Figure 5: We borrow a template from Alpaca for prompts and dataset entries required for fine-tuning a pre-trained model. Fine-tuning is an important step to ensure accurate responses to user queries for the specific application scenario.
  • ...and 12 more figures