Table of Contents
Fetching ...

GPT Carry-On: Training Foundation Model for Customization Could Be Simple, Scalable and Affordable

Jianqiao Wangni

TL;DR

This work asks whether we can customize foundation models for individual users or tasks without prohibitive retraining. It introduces GPT Carry-On, which freezes a base LLM and trains a lightweight carry-on adaptor on the final-layer embeddings, with a bridge that compresses embeddings to a trainable module on separate hardware. The approach leverages gating and an alpha scaling parameter to balance base knowledge with task-specific adjustments, and it frames customization within VC-dimension and gradient-boosting perspectives to analyze generalization. Empirically, carry-on training shows faster convergence and can yield measurable improvements in tasks like math reasoning, using minimal additional parameters (as low as ~1 MB) and modest compute, making personalized LLM customization scalable and affordable for deployment on inference-oriented hardware.

Abstract

Modern large language foundation models (LLM) have now entered the daily lives of millions of users. We ask a natural question whether it is possible to customize LLM for every user or every task. From system and industrial economy consideration, general continue-training or fine-tuning still require substantial computation and memory of training GPU nodes, whereas most inference nodes under deployment, possibly with lower-end GPUs, are configured to make forward pass fastest possible. We propose a framework to take full advantages of existing LLMs and systems of online service. We train an additional branch of transformer blocks on the final-layer embedding of pretrained LLMs, which is the base, then a carry-on module merge the base models to compose a customized LLM. We can mix multiple layers, or multiple LLMs specialized in different domains such as chat, coding, math, to form a new mixture of LLM that best fit a new task. As the base model don't need to update parameters, we are able to outsource most computation of the training job on inference nodes, and only train a lightweight carry-on on training nodes, where we consume less than 1GB GPU memory to train a 100M carry-on layer on 30B LLM. We tested Qwen and DeepSeek opensourced models for continue-pretraining and got faster loss convergence. We use it to improve solving math questions with extremely small computation and model size, with 1000 data samples of chain-of-thoughts, and as small as 1 MB parameters of two layer layer carry-on, and the results are promising.

GPT Carry-On: Training Foundation Model for Customization Could Be Simple, Scalable and Affordable

TL;DR

This work asks whether we can customize foundation models for individual users or tasks without prohibitive retraining. It introduces GPT Carry-On, which freezes a base LLM and trains a lightweight carry-on adaptor on the final-layer embeddings, with a bridge that compresses embeddings to a trainable module on separate hardware. The approach leverages gating and an alpha scaling parameter to balance base knowledge with task-specific adjustments, and it frames customization within VC-dimension and gradient-boosting perspectives to analyze generalization. Empirically, carry-on training shows faster convergence and can yield measurable improvements in tasks like math reasoning, using minimal additional parameters (as low as ~1 MB) and modest compute, making personalized LLM customization scalable and affordable for deployment on inference-oriented hardware.

Abstract

Modern large language foundation models (LLM) have now entered the daily lives of millions of users. We ask a natural question whether it is possible to customize LLM for every user or every task. From system and industrial economy consideration, general continue-training or fine-tuning still require substantial computation and memory of training GPU nodes, whereas most inference nodes under deployment, possibly with lower-end GPUs, are configured to make forward pass fastest possible. We propose a framework to take full advantages of existing LLMs and systems of online service. We train an additional branch of transformer blocks on the final-layer embedding of pretrained LLMs, which is the base, then a carry-on module merge the base models to compose a customized LLM. We can mix multiple layers, or multiple LLMs specialized in different domains such as chat, coding, math, to form a new mixture of LLM that best fit a new task. As the base model don't need to update parameters, we are able to outsource most computation of the training job on inference nodes, and only train a lightweight carry-on on training nodes, where we consume less than 1GB GPU memory to train a 100M carry-on layer on 30B LLM. We tested Qwen and DeepSeek opensourced models for continue-pretraining and got faster loss convergence. We use it to improve solving math questions with extremely small computation and model size, with 1000 data samples of chain-of-thoughts, and as small as 1 MB parameters of two layer layer carry-on, and the results are promising.

Paper Structure

This paper contains 10 sections, 15 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: The architectural design of the transformer-carry-on
  • Figure 2: Optimal carry-on model search of customization pipeline.
  • Figure 3: The convergence comparison between training on the carry-on wi/wo training base LLM.
  • Figure 4: Training convergence with different carry-on layer size (quantize bits=4, shallow shortcut layer depth = 0).
  • Figure 5: Training convergence with different quantization bits (qt_bits), and qt_bits = 0 means no quantization to the floating point embedding. (shortcut layer depth=0, hidden size = 256, carry-on layer = 3)
  • ...and 2 more figures