Table of Contents
Fetching ...

A Theoretical Survey on Foundation Models

Shi Fu, Yuzhu Chen, Yingjie Wang, Dacheng Tao

TL;DR

This survey aims to review those interpretable methods that comply with the aforementioned principles and have been successfully applied to FMs, and identifies the next frontier research directions for FMs.

Abstract

Understanding the inner mechanisms of black-box foundation models (FMs) is essential yet challenging in artificial intelligence and its applications. Over the last decade, the long-running focus has been on their explainability, leading to the development of post-hoc explainable methods to rationalize the specific decisions already made by black-box FMs. However, these explainable methods have certain limitations in terms of faithfulness and resource requirement. Consequently, a new class of interpretable methods should be considered to unveil the underlying mechanisms of FMs in an accurate, comprehensive, heuristic, and resource-light way. This survey aims to review those interpretable methods that comply with the aforementioned principles and have been successfully applied to FMs. These methods are deeply rooted in machine learning theory, covering the analysis of generalization performance, expressive capability, and dynamic behavior. They provide a thorough interpretation of the entire workflow of FMs, ranging from the inference capability and training dynamics to their ethical implications. Ultimately, drawing upon these interpretations, this review identifies the next frontier research directions for FMs.

A Theoretical Survey on Foundation Models

TL;DR

This survey aims to review those interpretable methods that comply with the aforementioned principles and have been successfully applied to FMs, and identifies the next frontier research directions for FMs.

Abstract

Understanding the inner mechanisms of black-box foundation models (FMs) is essential yet challenging in artificial intelligence and its applications. Over the last decade, the long-running focus has been on their explainability, leading to the development of post-hoc explainable methods to rationalize the specific decisions already made by black-box FMs. However, these explainable methods have certain limitations in terms of faithfulness and resource requirement. Consequently, a new class of interpretable methods should be considered to unveil the underlying mechanisms of FMs in an accurate, comprehensive, heuristic, and resource-light way. This survey aims to review those interpretable methods that comply with the aforementioned principles and have been successfully applied to FMs. These methods are deeply rooted in machine learning theory, covering the analysis of generalization performance, expressive capability, and dynamic behavior. They provide a thorough interpretation of the entire workflow of FMs, ranging from the inference capability and training dynamics to their ethical implications. Ultimately, drawing upon these interpretations, this review identifies the next frontier research directions for FMs.

Paper Structure

This paper contains 48 sections, 23 theorems, 77 equations, 16 figures, 2 tables.

Key Result

Theorem 3.5

(From Covering to Rademacher Complexity via Chaining shalev2014understanding) For any $A \subseteq \mathbb{R}^m$, and vector $\boldsymbol{a}, \boldsymbol{a'} \in \mathbb{R}^m$. Let $c = \min_{\boldsymbol{a'}} \max_{\boldsymbol{a} \in A} \|\boldsymbol{a} - \boldsymbol{a'}\|$. Then, for any integer $M

Figures (16)

  • Figure 1: An overview of interpretability, interpretable method and interpretations tailored for FMs.
  • Figure 2: Heatmap displaying attention weights generated from a negative movie review. On the left, we present the model's actual attention pattern on, while on the right, we showcase a set of attention weights created adversarially. Despite their significant differences, both patterns result in the same prediction (0.01) effectively.
  • Figure 3: The interpretable methods: generalization analysis, approximation capability analysis, and dynamic behavior analysis. In these figures, the empirical risk $\mathcal{R}_n$ (dotted line) and expected risk $\mathcal{R}$ (solid line) over a measurable function space are described, with $f_{initial}$, $f_\mathbf{z}$, $f_H$, and $f^*$ representing the initial model, learned model, optimal function over the hypothesis space, and optimal function over the measurable function space. The generalization error (depicted by the red area and formally denoted as $|\mathcal{R}(f) - \mathcal{R}_n(f)|$) offers insights into a model's performance on unseen data. The approximation error (depicted by the red solid line and formally denoted as $|\mathcal{R}(f_H) - \mathcal{R}(f^*)|$), as a manifestation of expressive power, investigates the discrepancy between the true underlying function and the optimal function within the hypothesis space. Dynamic behavior in this context refers to the trajectory of FMs' parameter changes from initial point $f_{initial}$ to converged solution $f_z$ using methods such as gradient descent or RLHF, which can provide valuable insights into the FMs' characteristics and potential issues.
  • Figure 4: Different Responses by GPT-3 to Identical Questions with Typos liu2023trustworthy.
  • Figure 5: Mismatch Between Context Examples and Query Leading to Misinterpretations and Erroneous Outputs
  • ...and 11 more figures

Theorems & Definitions (32)

  • Remark 3.1
  • Definition 3.2
  • Definition 3.3: Empirical Rademacher complexity
  • Definition 3.4
  • Theorem 3.5
  • Definition 3.6: Uniform stability bousquet2002stability
  • Definition 3.7: Error stability bousquet2002stability
  • Theorem 3.8: Exponential generalization bound in terms of uniform stability
  • Theorem 3.9
  • Theorem 3.10
  • ...and 22 more