TinyLLM: A Framework for Training and Deploying Language Models at the Edge Computers
Savitha Viswanadh Kandala, Pramuka Medaranga, Ambuj Varshney
TL;DR
TinyLLM tackles the challenge of deploying language-model capabilities on edge devices by advocating for and enabling training of compact, domain-specialized models (~30–124M parameters). The framework combines curated pre-training data, GPT-2–style architectures, and LoRA fine-tuning to deliver fast, private inference on edge hardware using GGUF/llama.cpp deployments. Empirical results across gesture, localization, and sensing datasets show that small, well-curated models can match or outperform larger models while requiring far fewer GPU hours and enabling real-time edge processing. This work demonstrates a practical pathway for edge-native intelligence in embedded sensing, reducing reliance on cloud-based inference and mitigating privacy and latency concerns.
Abstract
Language models have gained significant interest due to their general-purpose capabilities, which appear to emerge as models are scaled to increasingly larger parameter sizes. However, these large models impose stringent requirements on computing systems, necessitating significant memory and processing requirements for inference. This makes performing inference on mobile and edge devices challenging, often requiring invocating remotely-hosted models via network calls. Remote inference, in turn, introduces issues like latency, unreliable network connectivity, and privacy concerns. To address these challenges, we explored the possibility of deviating from the trend of increasing model size. Instead, we hypothesize that much smaller models (~30-120M parameters) can outperform their larger counterparts for specific tasks by carefully curating the data used for pre-training and fine-tuning. We investigate this within the context of deploying edge-device models to support sensing applications. We trained several foundational models through a systematic study and found that small models can run locally on edge devices, achieving high token rates and accuracy. Based on these findings, we developed a framework that allows users to train foundational models tailored to their specific applications and deploy them at the edge.
