Table of Contents
Fetching ...

LLMs on a Budget? Say HOLA

Zohaib Hasan Siddiqui, Jiechao Gao, Ebad Shabbir, Mohammad Anas Azeez, Rafiq Ali, Gautam Siddharth Kashyap, Usman Naseem

TL;DR

HOLA addresses the bottleneck of deploying LLMs on edge devices by unifying internal decoding acceleration, adaptive external retrieval, and memory-efficient compression into a holistic pipeline. It introduces Hierarchical Speculative Decoding (HSD), AdaComp-RAG, and Lo-Bi Optimization to achieve end-to-end efficiency, formalized through $y_t=\hat{y}_t$ when $g(t)=1$ else $f_{ver}(\mathbf{x},\hat{y}_{<t})$, a retrieval gate $C(\mathbf{q})=\left\|\nabla_{\mathbf{q}}\mathcal{L}\right\|_2$, and LoRA-based updates with mixed-precision quantization. Experimental results on GSM8K and ARC show substantial gains in Exact Match Accuracy and Multiple-Choice Accuracy, along with notable reductions in latency and memory across diverse models and hardware, including edge devices like Jetson Nano and Raspberry Pi and cloud GPUs like NVIDIA A100. These findings demonstrate HOLA’s potential to enable production-ready, real-time LLMs in healthcare, education, and embedded systems by balancing performance, resource use, and deployment practicality.

Abstract

Running Large Language Models (LLMs) on edge devices is constrained by high compute and memory demands posing a barrier for real-time applications in sectors like healthcare, education, and embedded systems. Current solutions such as quantization, pruning, and retrieval-augmented generation (RAG) offer only partial optimizations and often compromise on speed or accuracy. We introduce HOLA, an end-to-end optimization framework for efficient LLM deployment. Internally, it leverages Hierarchical Speculative Decoding (HSD) for faster inference without quality loss. Externally, AdaComp-RAG adjusts retrieval complexity based on context needs. Together with LoBi, which blends structured pruning (LoRA) and quantization, HOLA delivers significant gains: 17.6% EMA on GSM8K, 10.5% MCA on ARC, and reduced latency and memory on edge devices like Jetson Nano--proving both scalable and production-ready.

LLMs on a Budget? Say HOLA

TL;DR

HOLA addresses the bottleneck of deploying LLMs on edge devices by unifying internal decoding acceleration, adaptive external retrieval, and memory-efficient compression into a holistic pipeline. It introduces Hierarchical Speculative Decoding (HSD), AdaComp-RAG, and Lo-Bi Optimization to achieve end-to-end efficiency, formalized through when else , a retrieval gate , and LoRA-based updates with mixed-precision quantization. Experimental results on GSM8K and ARC show substantial gains in Exact Match Accuracy and Multiple-Choice Accuracy, along with notable reductions in latency and memory across diverse models and hardware, including edge devices like Jetson Nano and Raspberry Pi and cloud GPUs like NVIDIA A100. These findings demonstrate HOLA’s potential to enable production-ready, real-time LLMs in healthcare, education, and embedded systems by balancing performance, resource use, and deployment practicality.

Abstract

Running Large Language Models (LLMs) on edge devices is constrained by high compute and memory demands posing a barrier for real-time applications in sectors like healthcare, education, and embedded systems. Current solutions such as quantization, pruning, and retrieval-augmented generation (RAG) offer only partial optimizations and often compromise on speed or accuracy. We introduce HOLA, an end-to-end optimization framework for efficient LLM deployment. Internally, it leverages Hierarchical Speculative Decoding (HSD) for faster inference without quality loss. Externally, AdaComp-RAG adjusts retrieval complexity based on context needs. Together with LoBi, which blends structured pruning (LoRA) and quantization, HOLA delivers significant gains: 17.6% EMA on GSM8K, 10.5% MCA on ARC, and reduced latency and memory on edge devices like Jetson Nano--proving both scalable and production-ready.

Paper Structure

This paper contains 20 sections, 1 equation, 2 figures, 6 tables, 1 algorithm.

Figures (2)

  • Figure 1: HOLA architecture.
  • Figure 2: Detailed analysis of the impact of HOLA optimization and domain transfer on large language models (LLMs) across tasks and hardware settings. (a) shows how HOLA influences model ranking for the GSM8K and ARC datasets. (b) and (c) illustrate the efficiency changes when transferring models between ARC and GSM8K tasks on various hardware. (d) compares ranking shifts caused by HOLA optimization. (e) and (f) provide t-SNE visualizations of the latent space to highlight task domain effects for GSM8K and ARC, respectively.