Table of Contents
Fetching ...

OJBKQ: Objective-Joint Babai-Klein Quantization

Xinyu Wang, Ziyu Zhao, Peng Lu, Yu Gu, Xiao-Wen Chang

TL;DR

OJBKQ reframes layer-wise post-training quantization (PTQ) for large language models as a joint optimization over activations and weights, resulting in a multiple-right-hand-side box-constrained integer least squares (BILS) problem per layer. It introduces Joint Target Alignment (JTA) to smoothly interpolate between runtime and full-precision targets, and uses Random-$K$ Klein decoding to generate multiple suboptimal candidates, selecting the best by a unified JTA score. The method achieves lower perplexity at 3-4 bits and maintains stability across model families (Llama3, Qwen3, Mistral) with competitive compute cost, aided by GPU-efficient, path-isolated implementations. Across perplexity, zero-shot accuracy, and reasoning benchmarks, OJBKQ consistently outperforms strong PTQ baselines, particularly under aggressive low-bit settings, demonstrating the value of a principled lattice-decoding formulation for end-to-end PTQ robustness.

Abstract

Post-training quantization (PTQ) is widely used to compress large language models without retraining. However, many existing weight-only methods rely on heuristic objectives and greedy rounding, thus leading to noticeable degradation under low-bit quantization. In this work, we introduce OJBKQ (Objective-Joint Babai-Klein Quantization with K-Best Sampling), a layer-wise PTQ method that formulates weight quantization as a joint optimization problem over activations and weights. This formulation results in a multiple-right-hand-side box-constrained integer least squares (BILS) problem in each layer, which is NP-hard. For each column of the weight matrix, we apply an extended Babai nearest-plane algorithm and an extended version of Klein's randomized Babai algorithm to find the minimum-residual Babai-Klein point, a sub-optimal solution to the BILS problem. Experimental results on large language models show that OJBKQ achieves lower perplexity at 3-4 bits compared to existing PTQ approaches, while maintaining comparable computational cost.

OJBKQ: Objective-Joint Babai-Klein Quantization

TL;DR

OJBKQ reframes layer-wise post-training quantization (PTQ) for large language models as a joint optimization over activations and weights, resulting in a multiple-right-hand-side box-constrained integer least squares (BILS) problem per layer. It introduces Joint Target Alignment (JTA) to smoothly interpolate between runtime and full-precision targets, and uses Random- Klein decoding to generate multiple suboptimal candidates, selecting the best by a unified JTA score. The method achieves lower perplexity at 3-4 bits and maintains stability across model families (Llama3, Qwen3, Mistral) with competitive compute cost, aided by GPU-efficient, path-isolated implementations. Across perplexity, zero-shot accuracy, and reasoning benchmarks, OJBKQ consistently outperforms strong PTQ baselines, particularly under aggressive low-bit settings, demonstrating the value of a principled lattice-decoding formulation for end-to-end PTQ robustness.

Abstract

Post-training quantization (PTQ) is widely used to compress large language models without retraining. However, many existing weight-only methods rely on heuristic objectives and greedy rounding, thus leading to noticeable degradation under low-bit quantization. In this work, we introduce OJBKQ (Objective-Joint Babai-Klein Quantization with K-Best Sampling), a layer-wise PTQ method that formulates weight quantization as a joint optimization problem over activations and weights. This formulation results in a multiple-right-hand-side box-constrained integer least squares (BILS) problem in each layer, which is NP-hard. For each column of the weight matrix, we apply an extended Babai nearest-plane algorithm and an extended version of Klein's randomized Babai algorithm to find the minimum-residual Babai-Klein point, a sub-optimal solution to the BILS problem. Experimental results on large language models show that OJBKQ achieves lower perplexity at 3-4 bits compared to existing PTQ approaches, while maintaining comparable computational cost.
Paper Structure (25 sections, 21 equations, 4 figures, 4 tables, 4 algorithms)

This paper contains 25 sections, 21 equations, 4 figures, 4 tables, 4 algorithms.

Figures (4)

  • Figure 1: Layer-wise comparison of the original output norms and JTA reconstruction errors across Layers 1, 15, and 30.We present results for all linear modules under varying $K$ settings.
  • Figure 2: Ablation study on the candidate size $K$. We evaluate the perplexity on C4 and WikiText-2 datasets using Llama-3-8B with 4-bit quantization (group size 128). The results demonstrate the impact of increasing the search space $K$.
  • Figure 3: Sensitivity analysis of hyperparameters $\mu$ and $\lambda$. Evaluated on WikiText-2 (calibrated on C4), the plots show the perplexity trend when varying one parameter while fixing the other at 0.6. The U-shaped curve for $\mu$ (left) confirms the necessity of balancing the two objectives, while $\lambda$ (right) shows 0.6 as a robust operating point.
  • Figure 4: Layer Time increase ratio for different K on Llama3-8B 4 bits