Table of Contents
Fetching ...

Investigating the Impact of Model Complexity in Large Language Models

Jing Luo, Huiyuan Wang, Weiran Huang

TL;DR

This paper proposes to employ Hidden Markov Models (HMMs) to model autoregressive LLMs and investigates the relationship between model complexity and the generalization capability in downstream tasks, and considers a popular tuning paradigm for downstream tasks, head tuning.

Abstract

Large Language Models (LLMs) based on the pre-trained fine-tuning paradigm have become pivotal in solving natural language processing tasks, consistently achieving state-of-the-art performance. Nevertheless, the theoretical understanding of how model complexity influences fine-tuning performance remains challenging and has not been well explored yet. In this paper, we focus on autoregressive LLMs and propose to employ Hidden Markov Models (HMMs) to model them. Based on the HMM modeling, we investigate the relationship between model complexity and the generalization capability in downstream tasks. Specifically, we consider a popular tuning paradigm for downstream tasks, head tuning, where all pre-trained parameters are frozen and only individual heads are trained atop pre-trained LLMs. Our theoretical analysis reveals that the risk initially increases and then decreases with rising model complexity, showcasing a "double descent" phenomenon. In this case, the initial "descent" is degenerate, signifying that the "sweet spot" where bias and variance are balanced occurs when the model size is zero. Obtaining the presented in this study conclusion confronts several challenges, primarily revolving around effectively modeling autoregressive LLMs and downstream tasks, as well as conducting a comprehensive risk analysis for multivariate regression. Our research is substantiated by experiments conducted on data generated from HMMs, which provided empirical support and alignment with our theoretical insights.

Investigating the Impact of Model Complexity in Large Language Models

TL;DR

This paper proposes to employ Hidden Markov Models (HMMs) to model autoregressive LLMs and investigates the relationship between model complexity and the generalization capability in downstream tasks, and considers a popular tuning paradigm for downstream tasks, head tuning.

Abstract

Large Language Models (LLMs) based on the pre-trained fine-tuning paradigm have become pivotal in solving natural language processing tasks, consistently achieving state-of-the-art performance. Nevertheless, the theoretical understanding of how model complexity influences fine-tuning performance remains challenging and has not been well explored yet. In this paper, we focus on autoregressive LLMs and propose to employ Hidden Markov Models (HMMs) to model them. Based on the HMM modeling, we investigate the relationship between model complexity and the generalization capability in downstream tasks. Specifically, we consider a popular tuning paradigm for downstream tasks, head tuning, where all pre-trained parameters are frozen and only individual heads are trained atop pre-trained LLMs. Our theoretical analysis reveals that the risk initially increases and then decreases with rising model complexity, showcasing a "double descent" phenomenon. In this case, the initial "descent" is degenerate, signifying that the "sweet spot" where bias and variance are balanced occurs when the model size is zero. Obtaining the presented in this study conclusion confronts several challenges, primarily revolving around effectively modeling autoregressive LLMs and downstream tasks, as well as conducting a comprehensive risk analysis for multivariate regression. Our research is substantiated by experiments conducted on data generated from HMMs, which provided empirical support and alignment with our theoretical insights.
Paper Structure (25 sections, 5 theorems, 56 equations, 3 figures)

This paper contains 25 sections, 5 theorems, 56 equations, 3 figures.

Key Result

Lemma 1

For the least squares estimator $\widehat{B}$, the next word prediction risk has bias defined as $B_X(\widehat{B};B)$, and variance defined as $V_X(\widehat{B};B)$ .

Figures (3)

  • Figure 1: The risk curves illustrate the behavior of the least squares estimator $\widehat{B}$ concerning its dependence on the model size, denoted as $p$, which is also the dimensionality of the vector $x$. In this context, $p$ varies within the range of 50 to 150. These curves are based on a scenario with $n=100$, $d=50$, and varying levels of noise variance $\Sigma_{\epsilon}$. The continuous lines represent the analytical predictions derived from Definition \ref{['def: 2']}.
  • Figure 2: The risk curves illustrate the behavior of the least squares estimator $\widehat{B}$ concerning its dependence on the model size, denoted as $p$, which is also the dimensionality of the vector $x$. In this context, $p$ varies within the range of 50 to 350. These curves are based on a scenario with $\Sigma_{\epsilon}=0.5$, $d=50$, and varying levels of data samples $n$. The continuous lines represent the analytical predictions derived from Definition \ref{['def: 2']}.
  • Figure 3: The cross-entropy test loss curve is generated by employing prediction through head tuning on the Transformer model. In this experiment, both the training and test data are generated using HMM.

Theorems & Definitions (8)

  • Lemma 1
  • Theorem 1
  • Definition 1
  • Definition 2
  • Theorem 2
  • Definition 3
  • Theorem 3
  • Lemma 2