Table of Contents
Fetching ...

BackSlash: Rate Constrained Optimized Training of Large Language Models

Jun Wu, Jiangtao Wen, Yuxing Han

TL;DR

BackSlash tackles training-time efficiency for large language models by integrating rate-distortion optimization into the training objective. It models LLM parameters with a generalized Gaussian distribution and uses a discretized generalized Gaussian rate (DGGR) with entropy coding via exp-Golomb codes, guided by a Lagrange multiplier $\lambda$ to balance distortion and rate. The approach demonstrates memory reductions of up to 60%–90% with minimal accuracy loss and enhances pruning robustness and edge deployment viability across multiple architectures. Across models and tasks, BackSlash generalizes well, aided by adaptive distribution shaping and robust entropy coding, indicating a practical path to training smaller, deployment-ready foundation models.

Abstract

The rapid advancement of large-language models (LLMs) has driven extensive research into parameter compression after training has been completed, yet compression during the training phase remains largely unexplored. In this work, we introduce Rate-Constrained Training (BackSlash), a novel training-time compression approach based on rate-distortion optimization (RDO). BackSlash enables a flexible trade-off between model accuracy and complexity, significantly reducing parameter redundancy while preserving performance. Experiments in various architectures and tasks demonstrate that BackSlash can reduce memory usage by 60% - 90% without accuracy loss and provides significant compression gain compared to compression after training. Moreover, BackSlash proves to be highly versatile: it enhances generalization with small Lagrange multipliers, improves model robustness to pruning (maintaining accuracy even at 80% pruning rates), and enables network simplification for accelerated inference on edge devices.

BackSlash: Rate Constrained Optimized Training of Large Language Models

TL;DR

BackSlash tackles training-time efficiency for large language models by integrating rate-distortion optimization into the training objective. It models LLM parameters with a generalized Gaussian distribution and uses a discretized generalized Gaussian rate (DGGR) with entropy coding via exp-Golomb codes, guided by a Lagrange multiplier to balance distortion and rate. The approach demonstrates memory reductions of up to 60%–90% with minimal accuracy loss and enhances pruning robustness and edge deployment viability across multiple architectures. Across models and tasks, BackSlash generalizes well, aided by adaptive distribution shaping and robust entropy coding, indicating a practical path to training smaller, deployment-ready foundation models.

Abstract

The rapid advancement of large-language models (LLMs) has driven extensive research into parameter compression after training has been completed, yet compression during the training phase remains largely unexplored. In this work, we introduce Rate-Constrained Training (BackSlash), a novel training-time compression approach based on rate-distortion optimization (RDO). BackSlash enables a flexible trade-off between model accuracy and complexity, significantly reducing parameter redundancy while preserving performance. Experiments in various architectures and tasks demonstrate that BackSlash can reduce memory usage by 60% - 90% without accuracy loss and provides significant compression gain compared to compression after training. Moreover, BackSlash proves to be highly versatile: it enhances generalization with small Lagrange multipliers, improves model robustness to pruning (maintaining accuracy even at 80% pruning rates), and enables network simplification for accelerated inference on edge devices.

Paper Structure

This paper contains 15 sections, 8 equations, 8 figures, 6 tables, 2 algorithms.

Figures (8)

  • Figure 1: Parameter distributions fitting by generalized Gaussian distribution (GGD) and Gaussian distribution (GD) under different LLMs. GGD fits the boundaries of the parameter distributions better than GD does.
  • Figure 2: RD cost rate changes in training with different Lagrange multiplier ($\lambda$).
  • Figure 3: Shape parameter changes in training with different Lagrange multiplier ($\lambda$).
  • Figure 4: Impact of Lagrange multipliers on average code length of various encoding algorithms.
  • Figure 5: Impact of Lagrange multipliers on accuracy of test and train Dataset.
  • ...and 3 more figures