Table of Contents
Fetching ...

Exploring Mode Connectivity for Pre-trained Language Models

Yujia Qin, Cheng Qian, Jing Yi, Weize Chen, Yankai Lin, Xu Han, Zhiyuan Liu, Maosong Sun, Jie Zhou

TL;DR

This work analyzes how minima reached during PLM downstream adaptation connect in parameter space, using mode connectivity to assess linear and non-linear paths between minima. It demonstrates that initialization and pre-training strongly influence connectivity, with fine-tuning typically yielding better linear connectivity than delta tuning and pre-training pulling task representations closer together across tasks. The study also shows that non-linear Bezier curves can connect minima when linear paths fail, and it examines how task knowledge and memorization evolve along connecting paths. These findings have practical implications for ensemble methods, cross-task transferability, and understanding PLM downstream adaptation mechanisms.

Abstract

Recent years have witnessed the prevalent application of pre-trained language models (PLMs) in NLP. From the perspective of parameter space, PLMs provide generic initialization, starting from which high-performance minima could be found. Although plenty of works have studied how to effectively and efficiently adapt PLMs to high-performance minima, little is known about the connection of various minima reached under different adaptation configurations. In this paper, we investigate the geometric connections of different minima through the lens of mode connectivity, which measures whether two minima can be connected with a low-loss path. We conduct empirical analyses to investigate three questions: (1) how could hyperparameters, specific tuning methods, and training data affect PLM's mode connectivity? (2) How does mode connectivity change during pre-training? (3) How does the PLM's task knowledge change along the path connecting two minima? In general, exploring the mode connectivity of PLMs conduces to understanding the geometric connection of different minima, which may help us fathom the inner workings of PLM downstream adaptation.

Exploring Mode Connectivity for Pre-trained Language Models

TL;DR

This work analyzes how minima reached during PLM downstream adaptation connect in parameter space, using mode connectivity to assess linear and non-linear paths between minima. It demonstrates that initialization and pre-training strongly influence connectivity, with fine-tuning typically yielding better linear connectivity than delta tuning and pre-training pulling task representations closer together across tasks. The study also shows that non-linear Bezier curves can connect minima when linear paths fail, and it examines how task knowledge and memorization evolve along connecting paths. These findings have practical implications for ensemble methods, cross-task transferability, and understanding PLM downstream adaptation mechanisms.

Abstract

Recent years have witnessed the prevalent application of pre-trained language models (PLMs) in NLP. From the perspective of parameter space, PLMs provide generic initialization, starting from which high-performance minima could be found. Although plenty of works have studied how to effectively and efficiently adapt PLMs to high-performance minima, little is known about the connection of various minima reached under different adaptation configurations. In this paper, we investigate the geometric connections of different minima through the lens of mode connectivity, which measures whether two minima can be connected with a low-loss path. We conduct empirical analyses to investigate three questions: (1) how could hyperparameters, specific tuning methods, and training data affect PLM's mode connectivity? (2) How does mode connectivity change during pre-training? (3) How does the PLM's task knowledge change along the path connecting two minima? In general, exploring the mode connectivity of PLMs conduces to understanding the geometric connection of different minima, which may help us fathom the inner workings of PLM downstream adaptation.
Paper Structure (40 sections, 8 equations, 25 figures, 3 tables)

This paper contains 40 sections, 8 equations, 25 figures, 3 tables.

Figures (25)

  • Figure 1: The performance of linear interpolations between two minima trained with different training data order.
  • Figure 2: The performance of linear interpolations between two minima trained with different initialization.
  • Figure 3: The performance of interpolations along a non-linear path connecting two minima, which are trained with adapter tuning from different initialization.
  • Figure 4: Linear mode connectivity analysis for two minima trained with in-distribution MNLI data. The results on ReCoRD are left in \ref{['sec:additional_overlap']}.
  • Figure 5: Linear mode connectivity for two minima fine-tuned on different data distributions of the same task. Left: $\alpha = 0$ / $\alpha = 1$ denotes the minimum of MNLI / ANLI. Right: $\alpha = 0$ / $\alpha = 1$ denotes the minimum of Rotten Tomatoes / Yelp Polarity.
  • ...and 20 more figures