Table of Contents
Fetching ...

Speeding up Model Loading with fastsafetensors

Takeshi Yoshimura, Tatsuhiro Chiba, Manish Sethi, Daniel Waddington, Swaminathan Sundararaman

TL;DR

This work tackles the startup latency problem in large language model loading caused by inefficient safetensors deserialization. It introduces fastsafetensors, a batched deserialization approach that transfers groups of tensors directly to GPU memory and instantiates them via DLPack, decoupling I/O from tensor object creation and enabling GPU offloading and NUMA-aware transfers. The authors demonstrate substantial improvements (roughly 4.8x–7.5x) in loading times across multiple models and configurations, with additional gains from GPUDirect Storage and tensor-sharding offload to GPUs, and they validate integration with vLLM. The findings highlight significant practical impact for inference servers, enabling faster startups and improved resource utilization, while outlining deployment trade-offs and areas for future enhancement.

Abstract

The rapid increases in model parameter sizes introduces new challenges in pre-trained model loading. Currently, machine learning code often deserializes each parameter as a tensor object in host memory before copying it to device memory. We found that this approach underutilized storage throughput and significantly slowed down loading large models with a widely-used model file formats, safetensors. In this work, we present fastsafetensors, a Python library designed to optimize the deserialization of tensors in safetensors files. Our approach first copies groups of on-disk parameters to device memory, where they are directly instantiated as tensor objects. This design enables further optimization in low-level I/O and high-level tensor preprocessing, including parallelized copying, peer-to-peer DMA, and GPU offloading. Experimental results show performance improvements of 4.8x to 7.5x in loading models such as Llama (7, 13, and 70 billion parameters), Falcon (40 billion parameters), and the Bloom (176 billion parameters).

Speeding up Model Loading with fastsafetensors

TL;DR

This work tackles the startup latency problem in large language model loading caused by inefficient safetensors deserialization. It introduces fastsafetensors, a batched deserialization approach that transfers groups of tensors directly to GPU memory and instantiates them via DLPack, decoupling I/O from tensor object creation and enabling GPU offloading and NUMA-aware transfers. The authors demonstrate substantial improvements (roughly 4.8x–7.5x) in loading times across multiple models and configurations, with additional gains from GPUDirect Storage and tensor-sharding offload to GPUs, and they validate integration with vLLM. The findings highlight significant practical impact for inference servers, enabling faster startups and improved resource utilization, while outlining deployment trade-offs and areas for future enhancement.

Abstract

The rapid increases in model parameter sizes introduces new challenges in pre-trained model loading. Currently, machine learning code often deserializes each parameter as a tensor object in host memory before copying it to device memory. We found that this approach underutilized storage throughput and significantly slowed down loading large models with a widely-used model file formats, safetensors. In this work, we present fastsafetensors, a Python library designed to optimize the deserialization of tensors in safetensors files. Our approach first copies groups of on-disk parameters to device memory, where they are directly instantiated as tensor objects. This design enables further optimization in low-level I/O and high-level tensor preprocessing, including parallelized copying, peer-to-peer DMA, and GPU offloading. Experimental results show performance improvements of 4.8x to 7.5x in loading models such as Llama (7, 13, and 70 billion parameters), Falcon (40 billion parameters), and the Bloom (176 billion parameters).

Paper Structure

This paper contains 20 sections, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Overview of the safetensors format: The file is divided into a header and a body, with its layout defined by a JSON string.
  • Figure 2: Performance of current safetensors deserializer.
  • Figure 3: Resource utilization of TGIS.
  • Figure 4: Tensor copy flow.
  • Figure 5: Tensor sharding.
  • ...and 10 more figures