Table of Contents
Fetching ...

OneComp: One-Line Revolution for Generative AI Model Compression

Yuma Ichikawa, Keiji Kimura, Akihiro Yoshida, Yudai Fujimoto, Hiroki Tokura, Yamato Arai, Yoshiyuki Ishii, Yusei Kawakami, Genki Shikada, Achille Jacquemond, Yoshihiko Fujisawa, Katsuki Fujisawa, Takumi Honda, Akira Sakai

Abstract

Deploying foundation models is increasingly constrained by memory footprint, latency, and hardware costs. Post-training compression can mitigate these bottlenecks by reducing the precision of model parameters without significantly degrading performance; however, its practical implementation remains challenging as practitioners navigate a fragmented landscape of quantization algorithms, precision budgets, data-driven calibration strategies, and hardware-dependent execution regimes. We present OneComp, an open-source compression framework that transforms this expert workflow into a reproducible, resource-adaptive pipeline. Given a model identifier and available hardware, OneComp automatically inspects the model, plans mixed-precision assignments, and executes progressive quantization stages, ranging from layer-wise compression to block-wise refinement and global refinement. A key architectural choice is treating the first quantized checkpoint as a deployable pivot, ensuring that each subsequent stage improves the same model and that quality increases as more compute is invested. By converting state-of-the-art compression research into an extensible, open-source, hardware-aware pipeline, OneComp bridges the gap between algorithmic innovation and production-grade model deployment.

OneComp: One-Line Revolution for Generative AI Model Compression

Abstract

Deploying foundation models is increasingly constrained by memory footprint, latency, and hardware costs. Post-training compression can mitigate these bottlenecks by reducing the precision of model parameters without significantly degrading performance; however, its practical implementation remains challenging as practitioners navigate a fragmented landscape of quantization algorithms, precision budgets, data-driven calibration strategies, and hardware-dependent execution regimes. We present OneComp, an open-source compression framework that transforms this expert workflow into a reproducible, resource-adaptive pipeline. Given a model identifier and available hardware, OneComp automatically inspects the model, plans mixed-precision assignments, and executes progressive quantization stages, ranging from layer-wise compression to block-wise refinement and global refinement. A key architectural choice is treating the first quantized checkpoint as a deployable pivot, ensuring that each subsequent stage improves the same model and that quality increases as more compute is invested. By converting state-of-the-art compression research into an extensible, open-source, hardware-aware pipeline, OneComp bridges the gap between algorithmic innovation and production-grade model deployment.

Paper Structure

This paper contains 69 sections, 44 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overview of the OneComp pipeline. Top: end-to-end workflow from a pretrained foundation model to a deployment-ready quantized model. Bottom: the three quantization granularity levels on a Transformer with mixed-precision bit allocation. Cell colors indicate per-layer bit-widths assigned by AutoBit: Darker colors indicate lower precision. Layer-wise PTQ operates on one linear layer; block-wise PTQ on one Transformer block; global PTQ on the full model.
  • Figure 2: Comparison of 4-bit and 2-bit uniform quantization on the same weight distribution. Each panel overlays the quantization grid on a histogram of weights. 4-bit uniform quantization provides 16 fine-grained levels; 2-bit uniform quantization provides only 4, yielding coarser approximation and larger errors.
  • Figure 3: Three levels of quantization granularity for a weight matrix. Each color represents a group of elements sharing the same scale and zero-point. Per-tensor uses one set of parameters for the entire matrix; per-channel uses one per row; per-group divides each row into smaller groups.
  • Figure 4: Two quantization formats supported by OneComp. Left: GPTQ stores per-group integer weights $\bm{Q}$, scales $\bm{s}$, and zero-points $\bm{z}$, and reconstructs weights via $\hat{W}_{ij} = s_g(q_{ij} - z_g)$. Right: MDBF represents the weight matrix using shared binary sign bases $S_a, S_b$ together with low-rank real-valued envelopes $AP^\top$ and $BG^\top$, yielding $\hat{W} = (S_a \odot AP^\top)(S_b \odot BG^\top)^\top$.
  • Figure 5: Example module-wise bit allocation on Llama 3 under a 4.16 average bpw budget, which is equivalent to 4-bit quantization with a group size of 128. Top: naive mixed-precision allocation. Bottom: activation-aware allocation using diagonal activation statistics only. The activation-aware variant tends to reserve higher precision for modules with stronger activation outliers or higher activation sensitivity.
  • ...and 1 more figures