Table of Contents
Fetching ...

Robust Residual Finite Scalar Quantization for Neural Compression

Xiaoxu Zhu, Jiakui Li, Ken Zheng, Guiping Zhong, Huimeng Wang, Shiyin Kang, Dahua Lin

TL;DR

Robust Residual Finite Scalar Quantization (RFSQ) tackles the fundamental residual magnitude decay inherent in multi-stage FSQ by introducing two conditioning strategies: learnable scaling factors and invertible LayerNorm. The approach preserves FSQ’s simplicity while enabling effective progressive refinement across stages, demonstrated by a $3.646$ DNSMOS on audio (vs $3.518$ RVQ) and improved ImageNet metrics at 40 bits (e.g., $L1=0.102$, $LPIPS=0.100$ with LayerNorm). Across audio and image modalities, LayerNorm-based conditioning consistently outperforms unconditioned and alternative variants, validating its role in stabilizing inter-stage magnitudes and preserving information. The findings suggest RFSQ as a practical, generalizable plug-and-play framework for neural compression, with potential extensions to adaptive stage allocation and video compression.

Abstract

Finite Scalar Quantization (FSQ) offers simplified training but suffers from residual magnitude decay in multi-stage settings, where subsequent stages receive exponentially weaker signals. We propose Robust Residual Finite Scalar Quantization (RFSQ), addressing this fundamental limitation through two novel conditioning strategies: learnable scaling factors and invertible layer normalization. Our experiments across audio and image modalities demonstrate RFSQ's effectiveness and generalizability. In audio reconstruction at 24 bits/frame, RFSQ-LayerNorm achieves 3.646 DNSMOS, a 3.6% improvement over state-of-the-art RVQ (3.518). On ImageNet, RFSQ achieves 0.102 L1 loss and 0.100 perceptual loss, with LayerNorm providing 9.7% L1 improvement and 17.4% perceptual improvement over unconditioned variants. The LayerNorm strategy consistently outperforms alternatives by maintaining normalized input statistics across stages, effectively preventing exponential magnitude decay that limits naive residual approaches. RFSQ combines FSQ's simplicity with multi-stage quantization's representational power, establishing a new standard for neural compression across diverse modalities.

Robust Residual Finite Scalar Quantization for Neural Compression

TL;DR

Robust Residual Finite Scalar Quantization (RFSQ) tackles the fundamental residual magnitude decay inherent in multi-stage FSQ by introducing two conditioning strategies: learnable scaling factors and invertible LayerNorm. The approach preserves FSQ’s simplicity while enabling effective progressive refinement across stages, demonstrated by a DNSMOS on audio (vs RVQ) and improved ImageNet metrics at 40 bits (e.g., , with LayerNorm). Across audio and image modalities, LayerNorm-based conditioning consistently outperforms unconditioned and alternative variants, validating its role in stabilizing inter-stage magnitudes and preserving information. The findings suggest RFSQ as a practical, generalizable plug-and-play framework for neural compression, with potential extensions to adaptive stage allocation and video compression.

Abstract

Finite Scalar Quantization (FSQ) offers simplified training but suffers from residual magnitude decay in multi-stage settings, where subsequent stages receive exponentially weaker signals. We propose Robust Residual Finite Scalar Quantization (RFSQ), addressing this fundamental limitation through two novel conditioning strategies: learnable scaling factors and invertible layer normalization. Our experiments across audio and image modalities demonstrate RFSQ's effectiveness and generalizability. In audio reconstruction at 24 bits/frame, RFSQ-LayerNorm achieves 3.646 DNSMOS, a 3.6% improvement over state-of-the-art RVQ (3.518). On ImageNet, RFSQ achieves 0.102 L1 loss and 0.100 perceptual loss, with LayerNorm providing 9.7% L1 improvement and 17.4% perceptual improvement over unconditioned variants. The LayerNorm strategy consistently outperforms alternatives by maintaining normalized input statistics across stages, effectively preventing exponential magnitude decay that limits naive residual approaches. RFSQ combines FSQ's simplicity with multi-stage quantization's representational power, establishing a new standard for neural compression across diverse modalities.

Paper Structure

This paper contains 11 sections, 6 equations, 2 figures, 2 tables, 1 algorithm.

Figures (2)

  • Figure 1: DNSMOS evaluation results. Box plots show score distributions, with RFSQ variants (blue) consistently outperforming traditional baselines. Red dashed line indicates original audio quality (3.810).
  • Figure 2: Visual quality comparison. From top: original, RFSQ-2×2048-LN (22.0 bits), RFSQ-4×1024-LN (40.0 bits), RFSQ-4×1024-None (40.0 bits).