Hybrid Gated Flow (HGF): Stabilizing 1.58-bit LLMs via Selective Low-Rank Correction

David Alejandro Trejo Pizzo

Hybrid Gated Flow (HGF): Stabilizing 1.58-bit LLMs via Selective Low-Rank Correction

David Alejandro Trejo Pizzo

TL;DR

The paper tackles the memory bottleneck in edge deployment of large language models by combining extreme 1.58-bit ternary quantization with a gated low-rank FP16 correction path. The Hybrid Gated Flow (HGF) uses a gated LoRA-like correction alongside a differential attention mechanism to stabilize training and recover quality that pure ternary quantization loses. Empirical results on TinyStories show HGF achieving about a 0.93 validation loss, recovering roughly 55% of the quantization gap with only ~12-15% additional memory, and demonstrating architectural stability where full-precision differential attention diverges. The authors provide theoretical justification for the observed stabilization, scaling evidence to larger models with custom kernels, and discuss edge and cloud deployment implications, positioning HGF as a practical approach for memory-efficient yet capable LLMs on resource-constrained devices.

Abstract

The deployment of Large Language Models (LLMs) on edge devices is fundamentally constrained by the "Memory Wall" -- a hardware limitation where memory bandwidth, not compute, becomes the bottleneck. Recent 1.58-bit quantization techniques (e.g., BitNet b1.58) dramatically reduce memory footprint but typically incur a perplexity degradation of 20-25% compared to FP16 baselines. In this work, we introduce Hybrid Gated Flow (HGF), a dual-stream architecture that couples a 1.58-bit ternary backbone with a learnable, low-rank FP16 correction path controlled by adaptive gates. Through extensive experiments on the TinyStories dataset across two training regimes (2500 and 3500 steps), we demonstrate that HGF 5.4 achieves a validation loss of 0.9306 compared to BitNet's 1.0294, recovering approximately 55% of the quality gap between pure ternary quantization and the FP16 baseline (0.8490). This recovery is achieved with only ~12-15% memory overhead beyond the ternary backbone. Furthermore, we provide empirical evidence for an emergent phenomenon: quantization as structural regularization. While a full-precision differential attention baseline (Diff_Only) exhibited training instability with validation loss exceeding 1.68, the ternary-anchored HGF maintained robust convergence throughout training. Finally, we report preliminary results extending this architecture to 1.2B and 3B parameter models trained on SlimPajama and FineWeb-Edu. These larger-scale experiments confirm that the architectural stability and quality recovery observed in small-scale proxies scale linearly to production-grade language modeling regimes.

Hybrid Gated Flow (HGF): Stabilizing 1.58-bit LLMs via Selective Low-Rank Correction

TL;DR

Abstract

Paper Structure (46 sections, 10 theorems, 35 equations, 2 figures, 6 tables, 1 algorithm)

This paper contains 46 sections, 10 theorems, 35 equations, 2 figures, 6 tables, 1 algorithm.

Introduction
Contributions
Scope and Limitations
Design Rationale: The "Best-of-Breed" Synthesis
Methodology: The HGF Architecture
Preliminaries and Notation
Ternary Weight Quantization
Activation Quantization
Computational Complexity Analysis
The Gated Low-Rank Correction Mechanism
Gate Mechanism
Initialization and Training Stability
Differential Attention with Hybrid Projections
Differential Attention Mechanism
Training Protocol
...and 31 more sections

Key Result

Proposition 2.1

The gradient $\nabla_W Q(W)$ is zero almost everywhere and undefined at the decision boundaries. Formally:

Figures (2)

Figure 1: Comparative Architectural Topology. (a) Standard FP16 layers achieve high quality but consume significant memory. (b) BitNet b1.58 bitnet dramatically reduces memory but loses fine-grained expressiveness. (c) HGF combines the structural efficiency of ternary quantization with a gated LoRA lora correction pathway.
Figure 2: Gate Evolution During Training. The gate value increases during warmup as the model discovers useful corrections, stabilizes during regularization, and remains constant after freezing at step 900. Final value: $g \approx 0.1023$.

Theorems & Definitions (34)

Definition 2.1: Input Tensor
Definition 2.2: Absmax Quantization
Proposition 2.1: Gradient Discontinuity
Definition 2.3: Straight-Through Estimator
Remark 2.1
Definition 2.4: Dynamic Activation Quantization
Theorem 2.1: Complexity Reduction
proof
Definition 2.5: Quantization Error
Definition 2.6: Low-Rank Correction
...and 24 more

Hybrid Gated Flow (HGF): Stabilizing 1.58-bit LLMs via Selective Low-Rank Correction

TL;DR

Abstract

Hybrid Gated Flow (HGF): Stabilizing 1.58-bit LLMs via Selective Low-Rank Correction

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (34)