Table of Contents
Fetching ...

VcLLM: Video Codecs are Secretly Tensor Codecs

Ceyu Xu, Yongji Wu, Xinyu Yang, Beidi Chen, Matthew Lentz, Danyang Zhuo, Lisa Wu Wills

TL;DR

VcLLM addresses memory and bandwidth bottlenecks in LLM training and inference by repurposing video codecs as general-purpose tensor codecs. It demonstrates that HEVC/H.265 can compress weights, KV caches, activations, and gradients to fractional bit-widths on commodity GPUs using NVENC/NVDEC, enabling a $128k$ context with $4\times 8$GB devices and bitrates as low as $2.9$ bits per value for weights/KV and $3.5$ bits per value for activations. A two-stage weight compression pipeline combines RTN quantization with incoherence processing and rotation (QuaRot) to reach around $2.9$ bits without calibration, while activation and gradient compression enable substantial communication reductions and stable training. The paper also discusses accelerator design implications, proposing specialized tensor codecs with reduced die area and energy, and advocates open-sourcing VcLLM.

Abstract

As the parameter size of large language models (LLMs) continues to expand, the need for a large memory footprint and high communication bandwidth have become significant bottlenecks for the training and inference of LLMs. To mitigate these bottlenecks, various tensor compression techniques have been proposed to reduce the data size, thereby alleviating memory requirements and communication pressure. Our research found that video codecs, despite being originally designed for compressing videos, show excellent efficiency when compressing various types of tensors. We demonstrate that video codecs can be versatile and general-purpose tensor codecs while achieving the state-of-the-art compression efficiency in various tasks. We further make use of the hardware video encoding and decoding module available on GPUs to create a framework capable of both inference and training with video codecs repurposed as tensor codecs. This greatly reduces the requirement for memory capacity and communication bandwidth, enabling training and inference of large models on consumer-grade GPUs.

VcLLM: Video Codecs are Secretly Tensor Codecs

TL;DR

VcLLM addresses memory and bandwidth bottlenecks in LLM training and inference by repurposing video codecs as general-purpose tensor codecs. It demonstrates that HEVC/H.265 can compress weights, KV caches, activations, and gradients to fractional bit-widths on commodity GPUs using NVENC/NVDEC, enabling a context with GB devices and bitrates as low as bits per value for weights/KV and bits per value for activations. A two-stage weight compression pipeline combines RTN quantization with incoherence processing and rotation (QuaRot) to reach around bits without calibration, while activation and gradient compression enable substantial communication reductions and stable training. The paper also discusses accelerator design implications, proposing specialized tensor codecs with reduced die area and energy, and advocates open-sourcing VcLLM.

Abstract

As the parameter size of large language models (LLMs) continues to expand, the need for a large memory footprint and high communication bandwidth have become significant bottlenecks for the training and inference of LLMs. To mitigate these bottlenecks, various tensor compression techniques have been proposed to reduce the data size, thereby alleviating memory requirements and communication pressure. Our research found that video codecs, despite being originally designed for compressing videos, show excellent efficiency when compressing various types of tensors. We demonstrate that video codecs can be versatile and general-purpose tensor codecs while achieving the state-of-the-art compression efficiency in various tasks. We further make use of the hardware video encoding and decoding module available on GPUs to create a framework capable of both inference and training with video codecs repurposed as tensor codecs. This greatly reduces the requirement for memory capacity and communication bandwidth, enabling training and inference of large models on consumer-grade GPUs.
Paper Structure (22 sections, 1 equation, 10 figures, 2 tables)

This paper contains 22 sections, 1 equation, 10 figures, 2 tables.

Figures (10)

  • Figure 1: VcLLM: General-Purpose and Versatile Tensor Compression for LLM Training and Inference.
  • Figure 2: Why does the Video Codec Work for LLM? (a) illustrates the pipeline of the H.265 video encoder. In (b), we incrementally activate the stages in the H.265 video encoding pipeline to demonstrate how each step contributes to the compression process. We constrain the quality of the compression/decompression process to have a maximum mean square error of 0.01.
  • Figure 3: Transform coding mitigates encoding outliers by mapping them to all values within the block. The transition from (a) to (b) demonstrates how DCT removes outliers from a normal distribution matrix containing outliers. (c) to (d) shows a concrete example of how an outlier with a value of 128 is "smoothed" into other values within the block.
  • Figure 4: An example of a block of LLaMA-2-7B llama2 weights going through the H.265 pipeline. The intra-prediction step generates a rough prediction of the entire block, making the residuals easy to code with the DCT transform.
  • Figure 5: The trade-off between accuracy and average bid-width of different methods for compressing the LLaMA-2-7B model llama2 on eight commonsense reasoning tasks.
  • ...and 5 more figures