Table of Contents
Fetching ...

Reversing Large Language Models for Efficient Training and Fine-Tuning

Eshed Gal, Moshe Eliasof, Javier Turek, Uri Ascher, Eran Treister, Eldad Haber

TL;DR

<3-5 sentence high-level summary> The paper tackles the memory bottleneck in training and fine-tuning large language models by introducing memory-efficient, reversible architectures inspired by energy-conserving hyperbolic dynamics. It presents three reversible dynamics (Midpoint, Leapfrog, Hamiltonian) that enable exact reconstruction of activations during backpropagation, reducing activation memory with only modest compute overhead and enabling larger batch sizes. The authors demonstrate both training-from-scratch reversibles and practical retrofitting of pre-trained non-reversible models, achieving competitive or improved performance on various benchmarks while delivering substantial memory and throughput gains. The work lays a scalable path toward efficient, long-context LLMs and offers a concrete retrofit procedure to upgrade existing models to reversible training regimes.

Abstract

Large Language Models (LLMs) are known for their expensive and time-consuming training. Thus, oftentimes, LLMs are fine-tuned to address a specific task, given the pretrained weights of a pre-trained LLM considered a foundation model. In this work, we introduce memory-efficient, reversible architectures for LLMs, inspired by symmetric and symplectic differential equations, and investigate their theoretical properties. Different from standard, baseline architectures that store all intermediate activations, the proposed models use time-reversible dynamics to retrieve hidden states during backpropagation, relieving the need to store activations. This property allows for a drastic reduction in memory consumption, allowing for the processing of larger batch sizes for the same available memory, thereby offering improved throughput. In addition, we propose an efficient method for converting existing, non-reversible LLMs into reversible architectures through fine-tuning, rendering our approach practical for exploiting existing pre-trained models. Our results show comparable or improved performance on several datasets and benchmarks, on several LLMs, building a scalable and efficient path towards reducing the memory and computational costs associated with both training from scratch and fine-tuning of LLMs.

Reversing Large Language Models for Efficient Training and Fine-Tuning

TL;DR

<3-5 sentence high-level summary> The paper tackles the memory bottleneck in training and fine-tuning large language models by introducing memory-efficient, reversible architectures inspired by energy-conserving hyperbolic dynamics. It presents three reversible dynamics (Midpoint, Leapfrog, Hamiltonian) that enable exact reconstruction of activations during backpropagation, reducing activation memory with only modest compute overhead and enabling larger batch sizes. The authors demonstrate both training-from-scratch reversibles and practical retrofitting of pre-trained non-reversible models, achieving competitive or improved performance on various benchmarks while delivering substantial memory and throughput gains. The work lays a scalable path toward efficient, long-context LLMs and offers a concrete retrofit procedure to upgrade existing models to reversible training regimes.

Abstract

Large Language Models (LLMs) are known for their expensive and time-consuming training. Thus, oftentimes, LLMs are fine-tuned to address a specific task, given the pretrained weights of a pre-trained LLM considered a foundation model. In this work, we introduce memory-efficient, reversible architectures for LLMs, inspired by symmetric and symplectic differential equations, and investigate their theoretical properties. Different from standard, baseline architectures that store all intermediate activations, the proposed models use time-reversible dynamics to retrieve hidden states during backpropagation, relieving the need to store activations. This property allows for a drastic reduction in memory consumption, allowing for the processing of larger batch sizes for the same available memory, thereby offering improved throughput. In addition, we propose an efficient method for converting existing, non-reversible LLMs into reversible architectures through fine-tuning, rendering our approach practical for exploiting existing pre-trained models. Our results show comparable or improved performance on several datasets and benchmarks, on several LLMs, building a scalable and efficient path towards reducing the memory and computational costs associated with both training from scratch and fine-tuning of LLMs.

Paper Structure

This paper contains 26 sections, 37 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Reversible architectures: (a) explicit midpoint update and (b) leapfrog update.
  • Figure 1: Training and validation loss curves of GPT-2 using baseline, midpoint, and leapfrog architectures.
  • Figure 2: Training GPU memory usage vs. network depth for baseline and reversible (Midpoint) models. The baseline memory grows linearly and fails beyond 12 layers, while the reversible model remains constant.
  • Figure 3: Next token prediction cross-entropy loss during the conversion of TinyLlama-v1.0 to a reversible Midpoint architecture. The reversible model (Midpoint) closely matches the baseline in next-token prediction, demonstrating successful functional alignment.