Table of Contents
Fetching ...

Optimization of Armv9 architecture general large language model inference performance based on Llama.cpp

Longhao Chen, Yina Zhao, Qiangjun Xie, Qinghua Sheng

TL;DR

The work addresses efficient on-device inference of a Qwen-1.8B LLM on ARMv9 hardware under resource constraints. It employs Int8 quantization via llama.cpp, NEON-based operator vectorization, and a tuned GCC build to unlock aggressive compiler optimizations. On a Yitian 710 platform with 24 decoding layers, the approach yields prefill rate improvements to about 145 tokens/s, decode rate to about 48 tokens/s, and memory reduction to roughly 2.3 GiB, with an accuracy loss of about 0.0076 on piqa. The results demonstrate a practical, low-accuracy-loss path to accelerate edge inference, and the codebase is open-source for reproducibility.

Abstract

This article optimizes the inference performance of the Qwen-1.8B model by performing Int8 quantization, vectorizing some operators in llama.cpp, and modifying the compilation script to improve the compiler optimization level. On the Yitian 710 experimental platform, the prefill performance is increased by 1.6 times, the decoding performance is increased by 24 times, the memory usage is reduced to 1/5 of the original, and the accuracy loss is almost negligible.

Optimization of Armv9 architecture general large language model inference performance based on Llama.cpp

TL;DR

The work addresses efficient on-device inference of a Qwen-1.8B LLM on ARMv9 hardware under resource constraints. It employs Int8 quantization via llama.cpp, NEON-based operator vectorization, and a tuned GCC build to unlock aggressive compiler optimizations. On a Yitian 710 platform with 24 decoding layers, the approach yields prefill rate improvements to about 145 tokens/s, decode rate to about 48 tokens/s, and memory reduction to roughly 2.3 GiB, with an accuracy loss of about 0.0076 on piqa. The results demonstrate a practical, low-accuracy-loss path to accelerate edge inference, and the codebase is open-source for reproducibility.

Abstract

This article optimizes the inference performance of the Qwen-1.8B model by performing Int8 quantization, vectorizing some operators in llama.cpp, and modifying the compilation script to improve the compiler optimization level. On the Yitian 710 experimental platform, the prefill performance is increased by 1.6 times, the decoding performance is increased by 24 times, the memory usage is reduced to 1/5 of the original, and the accuracy loss is almost negligible.
Paper Structure (12 sections, 3 tables)