Bench360: Benchmarking Local LLM Inference from 360°
Linus Stuhlmann, Mauricio Fadel Argerich, Jonathan Fürst
TL;DR
Bench360 introduces a deployment-focused benchmarking framework that unifies task-specific metrics with system-level metrics to evaluate local LLM inference across models, quantizations, engines, and usage scenarios. It comprises a task engine, workload controller, backend abstraction, and system metrics collector, enabling reproducible, plug-in experimentation for custom tasks on multiple engines (e.g., vLLM, SGLang, TGI, LMDeploy). The experimental evaluation demonstrates memory-constrained deployment, scenario-focused serving, and overall system efficiency across four tasks and three GPUs, revealing nuanced trade-offs where there is no single best configuration. This work provides practitioners with a data-driven tool to optimize local LLM deployments by balancing quality, latency, energy, and concurrency, and it emphasizes the importance of aligning engine and quantization choices with deployment context.
Abstract
Running large language models (LLMs) locally is becoming increasingly common. While the growing availability of small open-source models and inference engines has lowered the entry barrier, users now face an overwhelming number of configuration choices. Identifying an optimal configuration -- balancing functional and non-functional requirements -- requires substantial manual effort. While several benchmarks target LLM inference, they are designed for narrow evaluation goals and not user-focused. They fail to integrate relevant system and task-specific metrics into a unified, easy-to-use benchmark that supports multiple inference engines, usage scenarios, and quantization levels. To address this gap, we present Bench360 -- Benchmarking Local LLM Inference from 360°. Bench360 allows users to easily define their own custom tasks along with datasets and relevant task-specific metrics and then automatically benchmarks selected LLMs, inference engines, and quantization levels across different usage scenarios (single stream, batch & server). Bench360 tracks a wide range of metrics, including (1) system metrics -- such as Computing Performance (e.g., latency, throughput), Resource Usage (e.g., energy per query), and Deployment (e.g., cold start time) -- and (2) task-specific metrics such as ROUGE, F1 score or accuracy. We demonstrate Bench360 on four common LLM tasks -- General Knowledge & Reasoning, QA, Summarization and Text-to-SQL -- across three hardware platforms and four state of the art inference engines. Our results reveal several interesting trade-offs between task performance and system-level efficiency, highlighting the differences in inference engines and models. Most importantly, there is no single best setup for local inference, which strongly motivates the need for a framework such as Bench360.
