Table of Contents
Fetching ...

Training Dynamics of Transformers to Recognize Word Co-occurrence via Gradient Flow Analysis

Hongru Yang, Bhavya Kailkhura, Zhangyang Wang, Yingbin Liang

TL;DR

This work studies the dynamics of training a shallow transformer on a task of recognizing co-occurrence of two designated words, and proves a novel property of the gradient flow, termed automatic balancing of gradients, which enables the loss values of different samples to decrease almost at the same rate and further facilitates the proof of near minimum training loss.

Abstract

Understanding the training dynamics of transformers is important to explain the impressive capabilities behind large language models. In this work, we study the dynamics of training a shallow transformer on a task of recognizing co-occurrence of two designated words. In the literature of studying training dynamics of transformers, several simplifications are commonly adopted such as weight reparameterization, attention linearization, special initialization, and lazy regime. In contrast, we analyze the gradient flow dynamics of simultaneously training three attention matrices and a linear MLP layer from random initialization, and provide a framework of analyzing such dynamics via a coupled dynamical system. We establish near minimum loss and characterize the attention model after training. We discover that gradient flow serves as an inherent mechanism that naturally divide the training process into two phases. In Phase 1, the linear MLP quickly aligns with the two target signals for correct classification, whereas the softmax attention remains almost unchanged. In Phase 2, the attention matrices and the MLP evolve jointly to enlarge the classification margin and reduce the loss to a near minimum value. Technically, we prove a novel property of the gradient flow, termed \textit{automatic balancing of gradients}, which enables the loss values of different samples to decrease almost at the same rate and further facilitates the proof of near minimum training loss. We also conduct experiments to verify our theoretical results.

Training Dynamics of Transformers to Recognize Word Co-occurrence via Gradient Flow Analysis

TL;DR

This work studies the dynamics of training a shallow transformer on a task of recognizing co-occurrence of two designated words, and proves a novel property of the gradient flow, termed automatic balancing of gradients, which enables the loss values of different samples to decrease almost at the same rate and further facilitates the proof of near minimum training loss.

Abstract

Understanding the training dynamics of transformers is important to explain the impressive capabilities behind large language models. In this work, we study the dynamics of training a shallow transformer on a task of recognizing co-occurrence of two designated words. In the literature of studying training dynamics of transformers, several simplifications are commonly adopted such as weight reparameterization, attention linearization, special initialization, and lazy regime. In contrast, we analyze the gradient flow dynamics of simultaneously training three attention matrices and a linear MLP layer from random initialization, and provide a framework of analyzing such dynamics via a coupled dynamical system. We establish near minimum loss and characterize the attention model after training. We discover that gradient flow serves as an inherent mechanism that naturally divide the training process into two phases. In Phase 1, the linear MLP quickly aligns with the two target signals for correct classification, whereas the softmax attention remains almost unchanged. In Phase 2, the attention matrices and the MLP evolve jointly to enlarge the classification margin and reduce the loss to a near minimum value. Technically, we prove a novel property of the gradient flow, termed \textit{automatic balancing of gradients}, which enables the loss values of different samples to decrease almost at the same rate and further facilitates the proof of near minimum training loss. We also conduct experiments to verify our theoretical results.

Paper Structure

This paper contains 37 sections, 70 theorems, 284 equations, 1 figure.

Key Result

Theorem 3.1

With probability at least $1 - \delta$ over the randomness of weight initialization, there exists a time $T_1 = \widetilde{O}(1/m)$ such that

Figures (1)

  • Figure 1: Synthetic experiments with illustration of two training phases.

Theorems & Definitions (138)

  • Definition 2.1: Data distribution
  • Remark 2.2
  • Theorem 3.1: Phase 1
  • Theorem 3.2: Phase 2
  • Theorem 3.3: Near Minimum Training Loss and Attention
  • Lemma 4.1: Same as \ref{['lemma: first_step_signal']}
  • Lemma 4.2: Same as \ref{['lemma: initial_common_token_per_neuron_update']}
  • Lemma 4.3: Abbreviated from \ref{['thm: common_token_gradient_stage_1']}
  • proof : Proof Intuition of \ref{['lemma: gradient_balancing_condition_phase1_main_text']}
  • Lemma 4.4: Same as \ref{['lemma: I_2_I_3_gradient_ratio_bound']}
  • ...and 128 more