Table of Contents
Fetching ...

LiveMind: Low-latency Large Language Models with Simultaneous Inference

Chuangtao Chen, Grace Li Zhang, Xunzhao Yin, Cheng Zhuo, Ulf Schlichtmann, Bing Li

TL;DR

A novel low-latency inference framework for large language model (LLM) inference which enables LLMs to perform inferences with incomplete user input is introduced, thereby significantly enhancing the interactive experience for users of LLMs.

Abstract

In this paper, we introduce LiveMind, a novel low-latency inference framework for large language model (LLM) inference which enables LLMs to perform inferences with incomplete user input. By reallocating computational processes to the input phase, a substantial reduction in latency is achieved, thereby significantly enhancing the interactive experience for users of LLMs. The framework adeptly manages the visibility of the streaming input to the model, allowing it to infer from incomplete user input or await additional content. Compared with traditional inference methods on complete user input, our approach demonstrates an average reduction in response latency of 84.0% on the MMLU dataset and 71.6% on the MMLU-Pro dataset, while maintaining comparable accuracy. Additionally, our framework facilitates collaborative inference and output across different models. By employing an large LLM for inference and a small LLM for output, we achieve an average 37% reduction in response latency, alongside a 4.30% improvement in accuracy on the MMLU-Pro dataset compared with the baseline. The proposed LiveMind framework advances the field of human-AI interaction by enabling more responsive and efficient communication between users and AI systems.

LiveMind: Low-latency Large Language Models with Simultaneous Inference

TL;DR

A novel low-latency inference framework for large language model (LLM) inference which enables LLMs to perform inferences with incomplete user input is introduced, thereby significantly enhancing the interactive experience for users of LLMs.

Abstract

In this paper, we introduce LiveMind, a novel low-latency inference framework for large language model (LLM) inference which enables LLMs to perform inferences with incomplete user input. By reallocating computational processes to the input phase, a substantial reduction in latency is achieved, thereby significantly enhancing the interactive experience for users of LLMs. The framework adeptly manages the visibility of the streaming input to the model, allowing it to infer from incomplete user input or await additional content. Compared with traditional inference methods on complete user input, our approach demonstrates an average reduction in response latency of 84.0% on the MMLU dataset and 71.6% on the MMLU-Pro dataset, while maintaining comparable accuracy. Additionally, our framework facilitates collaborative inference and output across different models. By employing an large LLM for inference and a small LLM for output, we achieve an average 37% reduction in response latency, alongside a 4.30% improvement in accuracy on the MMLU-Pro dataset compared with the baseline. The proposed LiveMind framework advances the field of human-AI interaction by enabling more responsive and efficient communication between users and AI systems.
Paper Structure (16 sections, 6 figures, 5 tables, 1 algorithm)

This paper contains 16 sections, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: An example of the LiveMind framework. (a) LiveMind inference with Llama-3-70B-Instruct model; (b) LiveMind inference with Llama-3-70B-Instruct with Llama-3-8B-Instruct models; (c) Conventional inference on complete user input.
  • Figure 2: Architecture of the LiveMind framework, the circled numbers correspond to lines of Algorithm \ref{['alg1:livemind']}.
  • Figure 3: Text segmentation used by the segmenter in the LiveMind framework. An example text segmented by (a) sentence; (b) clause; (c) word.
  • Figure 4: Five prompt-formats used by the LiveMind formatter: (a) previous inferences and new prompts; (b) U-PI format; (c) U-PIL format; (d) UA-PIL format; (e) U-SPI format; (f) UA-SPI format.
  • Figure 5: Latency (speedup) and accuracy of the LiveMind framework using word and character segmenters with Llama-3-70B-Instruct as both the inference model and the output model on the MMLU dataset.
  • ...and 1 more figures