Table of Contents
Fetching ...

Large Language Models on Small Resource-Constrained Systems: Performance Characterization, Analysis and Trade-offs

Liam Seymour, Basar Kutukcu, Sabur Baidya

TL;DR

This paper characterizes the feasibility of running modern LLMs on resource-constrained edge hardware by systematically evaluating NVIDIA Jetson Orin configurations with multiple Pythia models (70M–1.4B). Through a Python-based, batch-oriented testing framework, it analyzes latency, power, memory, and accuracy across hardware and software options, including 4-bit quantization and NV power models. Key findings reveal memory constraints on low-end devices, nuanced quantization effects that vary with model size, and clear trade-offs that inform device and configuration selection for constrained deployments. The work contributes a reusable benchmarking tool and a structured method to guide hardware–software co-design for on-device AI, with practical implications for privacy-sensitive and connectivity-limited applications.

Abstract

Generative AI like the Large Language Models (LLMs) has become more available for the general consumer in recent years. Publicly available services, e.g., ChatGPT, perform token generation on networked cloud server hardware, effectively removing the hardware entry cost for end users. However, the reliance on network access for these services, privacy and security risks involved, and sometimes the needs of the application make it necessary to run LLMs locally on edge devices. A significant amount of research has been done on optimization of LLMs and other transformer-based models on non-networked, resource-constrained devices, but they typically target older hardware. Our research intends to provide a 'baseline' characterization of more recent commercially available embedded hardware for LLMs, and to provide a simple utility to facilitate batch testing LLMs on recent Jetson hardware. We focus on the latest line of NVIDIA Jetson devices (Jetson Orin), and a set of publicly available LLMs (Pythia) ranging between 70 million and 1.4 billion parameters. Through detailed experimental evaluation with varying software and hardware parameters, we showcase trade-off spaces and optimization choices. Additionally, we design our testing structure to facilitate further research that involves performing batch LLM testing on Jetson hardware.

Large Language Models on Small Resource-Constrained Systems: Performance Characterization, Analysis and Trade-offs

TL;DR

This paper characterizes the feasibility of running modern LLMs on resource-constrained edge hardware by systematically evaluating NVIDIA Jetson Orin configurations with multiple Pythia models (70M–1.4B). Through a Python-based, batch-oriented testing framework, it analyzes latency, power, memory, and accuracy across hardware and software options, including 4-bit quantization and NV power models. Key findings reveal memory constraints on low-end devices, nuanced quantization effects that vary with model size, and clear trade-offs that inform device and configuration selection for constrained deployments. The work contributes a reusable benchmarking tool and a structured method to guide hardware–software co-design for on-device AI, with practical implications for privacy-sensitive and connectivity-limited applications.

Abstract

Generative AI like the Large Language Models (LLMs) has become more available for the general consumer in recent years. Publicly available services, e.g., ChatGPT, perform token generation on networked cloud server hardware, effectively removing the hardware entry cost for end users. However, the reliance on network access for these services, privacy and security risks involved, and sometimes the needs of the application make it necessary to run LLMs locally on edge devices. A significant amount of research has been done on optimization of LLMs and other transformer-based models on non-networked, resource-constrained devices, but they typically target older hardware. Our research intends to provide a 'baseline' characterization of more recent commercially available embedded hardware for LLMs, and to provide a simple utility to facilitate batch testing LLMs on recent Jetson hardware. We focus on the latest line of NVIDIA Jetson devices (Jetson Orin), and a set of publicly available LLMs (Pythia) ranging between 70 million and 1.4 billion parameters. Through detailed experimental evaluation with varying software and hardware parameters, we showcase trade-off spaces and optimization choices. Additionally, we design our testing structure to facilitate further research that involves performing batch LLM testing on Jetson hardware.

Paper Structure

This paper contains 16 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Latency results across all device configurations, for both model loading and token generation.
  • Figure 2: Accuracy of each LLM, tested using the LM Evaluation Harnesseval-harness.
  • Figure 3: A comparison of the effects of quantization on median total token generation time for the Orin NX 16GB, at max NV power model.
  • Figure 4: Median peak memory allocated during both model loading and token generation, at the max NV power model. Although the Jetson device configurations do not use separate memory hardware for RAM and VRAM, the distinction within the software is shown.
  • Figure 5: Median peak power usage (in watts) during token generation.
  • ...and 2 more figures