Designing Large Foundation Models for Efficient Training and Inference: A Survey

Dong Liu; Yanxuan Yu; Yite Wang; Jing Wu; Zhongwei Wan; Sina Alinejad; Benjamin Lengerich; Ying Nian Wu

Designing Large Foundation Models for Efficient Training and Inference: A Survey

Dong Liu, Yanxuan Yu, Yite Wang, Jing Wu, Zhongwei Wan, Sina Alinejad, Benjamin Lengerich, Ying Nian Wu

TL;DR

This survey assesses how large foundation models can be trained and deployed more efficiently by exploring three intertwined paths: model design, system design, and model-system co-design. It covers quantization (bit-width based and method-based), knowledge distillation (soft and hard), and pruning (unstructured and structured) as core model-level strategies, and KV cache design, token reduction, sparsity-aware optimization, and multi-modal fusion as system-level techniques. It then examines how MoE, mixed precision training, efficient pretraining, and efficient fine-tuning enable scalable, cost-effective development, culminating in a unified framework for model-system co-design. The work emphasizes practical improvements for accessibility and affordability of foundation models across NLP, vision, and multimodal applications, with broad implications for real-world deployment and research efficiency.

Abstract

This paper focuses on modern efficient training and inference technologies on foundation models and illustrates them from two perspectives: model and system design. Model and System Design optimize LLM training and inference from different aspects to save computational resources, making LLMs more efficient, affordable, and more accessible. The paper list repository is available at https://github.com/NoakLiu/Efficient-Foundation-Models-Survey.

Designing Large Foundation Models for Efficient Training and Inference: A Survey

TL;DR

Abstract

Paper Structure (66 sections, 54 equations, 2 figures, 4 tables)

This paper contains 66 sections, 54 equations, 2 figures, 4 tables.

Introduction
Model Design for Efficient Foundation Models
Quantization
Bit-width Based Quantization
Method-Based Quantization
Knowledge Distillation
Soft Knowledge Distillation
Hard Knowledge Distillation
Pruning
Unstructured Pruning
Saliency-Based Pruning.
Optimization Based Pruning.
Structured Pruning
Saliency-based Pruning.
Optimization-based Pruning.
...and 51 more sections

Figures (2)

Figure 1: Efficient Foundation Models Overview
Figure 3: KV Cache Design Pipeline

Designing Large Foundation Models for Efficient Training and Inference: A Survey

TL;DR

Abstract

Designing Large Foundation Models for Efficient Training and Inference: A Survey

Authors

TL;DR

Abstract

Table of Contents

Figures (2)