Table of Contents
Fetching ...

PT$^2$-LLM: Post-Training Ternarization for Large Language Models

Xianglong Yan, Chengzhu Bao, Zhiteng Li, Tianao Zhang, Kaicheng Yang, Haotong Qin, Ruobing Xie, Xingwu Sun, Yulun Zhang

TL;DR

PT$^2$-LLM tackles the challenge of post-training ternarization for large language models by introducing an Asymmetric Ternary Quantizer (ATQ) refined through Iterative Ternary Fitting (ITF) and Activation-aware Grid Alignment (AGA), along with a Structural Similarity-based Reordering (SSR) to mitigate outliers. The training-free framework achieves competitive zero-shot QA accuracy and perplexity at a drastically reduced memory footprint (around 1.58–1.59-bit equivalents for various backbones), while delivering substantial end-to-end speedups in prefill and decoding. Key innovations include a closed-form, row-wise grid optimization for $(\alpha, \mu)$, flexible ternary rounding for $\mathbf{T}$, activation-aware output alignment, and block-wise, structure-aware column reordering that improves quantization stability. Empirical results on LLaMA, LLaMA-2, LLaMA-3, and Qwen3-base show PT$^2$-LLM outperforms many 2-bit PTQ baselines and reduces model size significantly, making sub-2-bit ternarization practical for real-world LLM deployment.

Abstract

Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment. Ternarization has gained attention as a promising compression technique, delivering substantial size reduction and high computational efficiency. However, its potential in the post-training quantization (PTQ) setting remains underexplored, due to the challenge of training-free parameter optimization and the quantization difficulty posed by outliers and dispersed weights. To address these issues, we propose PT$^2$-LLM, a post-training ternarization framework tailored for LLMs. At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline: (1) Iterative Ternary Fitting (ITF), which alternates between optimal ternary grid construction and flexible rounding to minimize quantization error, and (2) Activation-aware Grid Alignment (AGA), which further refines the ternary grid to better match full-precision outputs. In addition, we propose a plug-and-play Structural Similarity-based Reordering (SSR) strategy that leverages inter-column structural similarity to ease quantization and mitigate outlier effects, further enhancing overall performance. Extensive experiments demonstrate that PT$^2$-LLM delivers competitive performance against state-of-the-art (SOTA) 2-bit PTQ methods with lower memory cost, while also accelerating both prefill and decoding to achieve end-to-end speedup. The code and models will be available at https://github.com/XIANGLONGYAN/PT2-LLM.

PT$^2$-LLM: Post-Training Ternarization for Large Language Models

TL;DR

PT-LLM tackles the challenge of post-training ternarization for large language models by introducing an Asymmetric Ternary Quantizer (ATQ) refined through Iterative Ternary Fitting (ITF) and Activation-aware Grid Alignment (AGA), along with a Structural Similarity-based Reordering (SSR) to mitigate outliers. The training-free framework achieves competitive zero-shot QA accuracy and perplexity at a drastically reduced memory footprint (around 1.58–1.59-bit equivalents for various backbones), while delivering substantial end-to-end speedups in prefill and decoding. Key innovations include a closed-form, row-wise grid optimization for , flexible ternary rounding for , activation-aware output alignment, and block-wise, structure-aware column reordering that improves quantization stability. Empirical results on LLaMA, LLaMA-2, LLaMA-3, and Qwen3-base show PT-LLM outperforms many 2-bit PTQ baselines and reduces model size significantly, making sub-2-bit ternarization practical for real-world LLM deployment.

Abstract

Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment. Ternarization has gained attention as a promising compression technique, delivering substantial size reduction and high computational efficiency. However, its potential in the post-training quantization (PTQ) setting remains underexplored, due to the challenge of training-free parameter optimization and the quantization difficulty posed by outliers and dispersed weights. To address these issues, we propose PT-LLM, a post-training ternarization framework tailored for LLMs. At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline: (1) Iterative Ternary Fitting (ITF), which alternates between optimal ternary grid construction and flexible rounding to minimize quantization error, and (2) Activation-aware Grid Alignment (AGA), which further refines the ternary grid to better match full-precision outputs. In addition, we propose a plug-and-play Structural Similarity-based Reordering (SSR) strategy that leverages inter-column structural similarity to ease quantization and mitigate outlier effects, further enhancing overall performance. Extensive experiments demonstrate that PT-LLM delivers competitive performance against state-of-the-art (SOTA) 2-bit PTQ methods with lower memory cost, while also accelerating both prefill and decoding to achieve end-to-end speedup. The code and models will be available at https://github.com/XIANGLONGYAN/PT2-LLM.

Paper Structure

This paper contains 17 sections, 16 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: LLaMA performance on 7 zero-shot Question Answering (QA) datasets. PT$^2$-LLM yields the best accuracy at equal memory cost.
  • Figure 2: Overview of PT$^2$-LLM. Structural Similarity-based Reordering (SSR): reorders columns based on structural similarity. Asymmetric Ternary Quantizer: enhanced by Iterative Ternary Fitting (ITF) and Activation-aware Grid Alignment (AGA) for refined ternary parameter optimization.
  • Figure 3: Visualization of the proposed Asymmetric Ternary Quantizer (ATQ) and Structural Similarity-based Reordering (SSR) effects. Left: Quantization error $\mathcal{E}_w$ across optimization steps during ATQ. Middle: Output error $\mathcal{E}_x$ across optimization steps during ATQ. Right: After column reordering, the block-wise variance becomes smaller, showing a more compact weight distribution.
  • Figure 4: Throughput comparison between ternary (1.58-bit) and 2-bit quantized models across LLaMA 7B–65B. We evaluate performance on prefill, decode, and end-to-end generation stages.
  • Figure : Pseudocode of the Asymmetric Ternary Quantizer. See supp. file for details.