Table of Contents
Fetching ...

Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models

Shuming Shi, Enbo Zhao, Deng Cai, Leyang Cui, Xinting Huang, Huayang Li

TL;DR

This work tackles the challenge of deploying large language models under latency and resource constraints. It introduces Inferflow, a modular, config-driven inference engine built from composable atomic blocks to support new transformer variants without code changes. Key contributions include 3.5-bit post-training quantization, a hybrid multi-GPU partitioning strategy, dynamic batching, grouped-query attention, and multi-format loading support, with speculative decoding planned for future acceleration. The approach aims to deliver faster inference, higher throughput, and memory efficiency while remaining extensible across hardware and data formats. Inferflow is positioned as a practical, open-source solution for scalable LLM deployment.

Abstract

We present Inferflow, an efficient and highly configurable inference engine for large language models (LLMs). With Inferflow, users can serve most of the common transformer models by simply modifying some lines in corresponding configuration files, without writing a single line of source code. Compared with most existing inference engines, Inferflow has some key features. First, by implementing a modular framework of atomic build-blocks and technologies, Inferflow is compositionally generalizable to new models. Second, 3.5-bit quantization is introduced in Inferflow as a tradeoff between 3-bit and 4-bit quantization. Third, hybrid model partitioning for multi-GPU inference is introduced in Inferflow to better balance inference speed and throughput than the existing partition-by-layer and partition-by-tensor strategies.

Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models

TL;DR

This work tackles the challenge of deploying large language models under latency and resource constraints. It introduces Inferflow, a modular, config-driven inference engine built from composable atomic blocks to support new transformer variants without code changes. Key contributions include 3.5-bit post-training quantization, a hybrid multi-GPU partitioning strategy, dynamic batching, grouped-query attention, and multi-format loading support, with speculative decoding planned for future acceleration. The approach aims to deliver faster inference, higher throughput, and memory efficiency while remaining extensible across hardware and data formats. Inferflow is positioned as a practical, open-source solution for scalable LLM deployment.

Abstract

We present Inferflow, an efficient and highly configurable inference engine for large language models (LLMs). With Inferflow, users can serve most of the common transformer models by simply modifying some lines in corresponding configuration files, without writing a single line of source code. Compared with most existing inference engines, Inferflow has some key features. First, by implementing a modular framework of atomic build-blocks and technologies, Inferflow is compositionally generalizable to new models. Second, 3.5-bit quantization is introduced in Inferflow as a tradeoff between 3-bit and 4-bit quantization. Third, hybrid model partitioning for multi-GPU inference is introduced in Inferflow to better balance inference speed and throughput than the existing partition-by-layer and partition-by-tensor strategies.
Paper Structure (19 sections, 12 equations, 2 figures, 5 tables, 1 algorithm)

This paper contains 19 sections, 12 equations, 2 figures, 5 tables, 1 algorithm.

Figures (2)

  • Figure 1: Implementation status of key technologies in Inferflow.
  • Figure 2: An illustration of two batching algorithms.