Hybrid Gated Flow (HGF): Stabilizing 1.58-bit LLMs via Selective Low-Rank Correction
David Alejandro Trejo Pizzo
TL;DR
The paper tackles the memory bottleneck in edge deployment of large language models by combining extreme 1.58-bit ternary quantization with a gated low-rank FP16 correction path. The Hybrid Gated Flow (HGF) uses a gated LoRA-like correction alongside a differential attention mechanism to stabilize training and recover quality that pure ternary quantization loses. Empirical results on TinyStories show HGF achieving about a 0.93 validation loss, recovering roughly 55% of the quantization gap with only ~12-15% additional memory, and demonstrating architectural stability where full-precision differential attention diverges. The authors provide theoretical justification for the observed stabilization, scaling evidence to larger models with custom kernels, and discuss edge and cloud deployment implications, positioning HGF as a practical approach for memory-efficient yet capable LLMs on resource-constrained devices.
Abstract
The deployment of Large Language Models (LLMs) on edge devices is fundamentally constrained by the "Memory Wall" -- a hardware limitation where memory bandwidth, not compute, becomes the bottleneck. Recent 1.58-bit quantization techniques (e.g., BitNet b1.58) dramatically reduce memory footprint but typically incur a perplexity degradation of 20-25% compared to FP16 baselines. In this work, we introduce Hybrid Gated Flow (HGF), a dual-stream architecture that couples a 1.58-bit ternary backbone with a learnable, low-rank FP16 correction path controlled by adaptive gates. Through extensive experiments on the TinyStories dataset across two training regimes (2500 and 3500 steps), we demonstrate that HGF 5.4 achieves a validation loss of 0.9306 compared to BitNet's 1.0294, recovering approximately 55% of the quality gap between pure ternary quantization and the FP16 baseline (0.8490). This recovery is achieved with only ~12-15% memory overhead beyond the ternary backbone. Furthermore, we provide empirical evidence for an emergent phenomenon: quantization as structural regularization. While a full-precision differential attention baseline (Diff_Only) exhibited training instability with validation loss exceeding 1.68, the ternary-anchored HGF maintained robust convergence throughout training. Finally, we report preliminary results extending this architecture to 1.2B and 3B parameter models trained on SlimPajama and FineWeb-Edu. These larger-scale experiments confirm that the architectural stability and quality recovery observed in small-scale proxies scale linearly to production-grade language modeling regimes.
