Table of Contents
Fetching ...

FluidML: Fast and Memory Efficient Inference Optimization

Jinjie Liu, Hang Qiu

TL;DR

FluidML is a generic runtime memory management and optimization framework that can flexibly transform the model execution blueprint to achieve faster and more memory-efficient inference.

Abstract

Machine learning models deployed on edge devices have enabled numerous exciting new applications, such as humanoid robots, AR glasses, and autonomous vehicles. However, the computing resources available on these edge devices are not catching up with the ever-growing number of parameters in these models. As the models become bigger and more complicated, the novel yet sophisticated structure challenges the inference runtime optimization. We present FluidML, a generic runtime memory management and optimization framework that can flexibly transform the model execution blueprint to achieve faster and more memory-efficient inference. Evaluations across different platforms show that FluidML can consistently reduce the end-to-end inference latency by up to 25.38% for popular language models and reduce peak memory usage by up to 41.47%, compared to state-of-the-art approaches. FluidML is of ~30K line of codes, built for general-purpose usage, and will be released as an open-source inference runtime optimization framework to the community.

FluidML: Fast and Memory Efficient Inference Optimization

TL;DR

FluidML is a generic runtime memory management and optimization framework that can flexibly transform the model execution blueprint to achieve faster and more memory-efficient inference.

Abstract

Machine learning models deployed on edge devices have enabled numerous exciting new applications, such as humanoid robots, AR glasses, and autonomous vehicles. However, the computing resources available on these edge devices are not catching up with the ever-growing number of parameters in these models. As the models become bigger and more complicated, the novel yet sophisticated structure challenges the inference runtime optimization. We present FluidML, a generic runtime memory management and optimization framework that can flexibly transform the model execution blueprint to achieve faster and more memory-efficient inference. Evaluations across different platforms show that FluidML can consistently reduce the end-to-end inference latency by up to 25.38% for popular language models and reduce peak memory usage by up to 41.47%, compared to state-of-the-art approaches. FluidML is of ~30K line of codes, built for general-purpose usage, and will be released as an open-source inference runtime optimization framework to the community.

Paper Structure

This paper contains 12 sections, 1 equation, 12 figures, 5 tables, 4 algorithms.

Figures (12)

  • Figure 1: Memory access patterns for matrix multiplication
  • Figure 2: Mockup example for the optimal memory layout challenge: graphs with simple connections (left) can be solved optimally in linear time, while complicated connections (right) and dependencies make the search untractable.
  • Figure 3: Memory Allocation Strategy. The left is the naive version, and the right is the dynamic programming version.
  • Figure 4: FluidML Workflow
  • Figure 5: Normalized Latency on AMD
  • ...and 7 more figures