Table of Contents
Fetching ...

What Happens To BERT Embeddings During Fine-tuning?

Amil Merchant, Elahe Rahimtoroghi, Ellie Pavlick, Ian Tenney

TL;DR

The paper probes how BERT representations evolve during fine-tuning for dependency parsing, MNLI, and SQuAD using edge and structural probing, Representational Similarity Analysis (RSA), and layer ablations. It finds no catastrophic forgetting of linguistic features from pre-training, with changes concentrated in the top layers and varying depth depending on the task (deep reconfiguration for parsing, shallower changes for MNLI and SQuAD). RSA and ablations reveal that in-domain inputs drive most changes, while out-of-domain sentences remain close to the pre-trained representations, suggesting limited generalization shifts. Overall, fine-tuning is a conservative adaptation that preserves linguistic information while selectively reconfiguring the encoder, highlighting both the strength and potential limits of standard transfer procedures.

Abstract

While there has been much recent work studying how linguistic information is encoded in pre-trained sentence representations, comparatively little is understood about how these models change when adapted to solve downstream tasks. Using a suite of analysis techniques (probing classifiers, Representational Similarity Analysis, and model ablations), we investigate how fine-tuning affects the representations of the BERT model. We find that while fine-tuning necessarily makes significant changes, it does not lead to catastrophic forgetting of linguistic phenomena. We instead find that fine-tuning primarily affects the top layers of BERT, but with noteworthy variation across tasks. In particular, dependency parsing reconfigures most of the model, whereas SQuAD and MNLI appear to involve much shallower processing. Finally, we also find that fine-tuning has a weaker effect on representations of out-of-domain sentences, suggesting room for improvement in model generalization.

What Happens To BERT Embeddings During Fine-tuning?

TL;DR

The paper probes how BERT representations evolve during fine-tuning for dependency parsing, MNLI, and SQuAD using edge and structural probing, Representational Similarity Analysis (RSA), and layer ablations. It finds no catastrophic forgetting of linguistic features from pre-training, with changes concentrated in the top layers and varying depth depending on the task (deep reconfiguration for parsing, shallower changes for MNLI and SQuAD). RSA and ablations reveal that in-domain inputs drive most changes, while out-of-domain sentences remain close to the pre-trained representations, suggesting limited generalization shifts. Overall, fine-tuning is a conservative adaptation that preserves linguistic information while selectively reconfiguring the encoder, highlighting both the strength and potential limits of standard transfer procedures.

Abstract

While there has been much recent work studying how linguistic information is encoded in pre-trained sentence representations, comparatively little is understood about how these models change when adapted to solve downstream tasks. Using a suite of analysis techniques (probing classifiers, Representational Similarity Analysis, and model ablations), we investigate how fine-tuning affects the representations of the BERT model. We find that while fine-tuning necessarily makes significant changes, it does not lead to catastrophic forgetting of linguistic phenomena. We instead find that fine-tuning primarily affects the top layers of BERT, but with noteworthy variation across tasks. In particular, dependency parsing reconfigures most of the model, whereas SQuAD and MNLI appear to involve much shallower processing. Finally, we also find that fine-tuning has a weaker effect on representations of out-of-domain sentences, suggesting room for improvement in model generalization.

Paper Structure

This paper contains 20 sections, 2 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Comparison of the structural probe performance on BERT models before and after fine-tuning. The stability of the Spearman correlations between both the depths and distance probes suggest that the embeddings still retain significant information about the syntax of inputted sentences.
  • Figure 2: Comparison of the representations from BERT base and the various fine-tuned models, when tested on Wikipedia examples. The dependency probing model starts to diverge from BERT Base around layer 5, which matches the previous results from edge probing. For the MNLI and SQuAD fine-tuned models, the differences from the Base model mainly arise in the top layers of the network.
  • Figure 3: Effects of freezing an increasing number of layers during fine-tuning on performance, where different lines correspond to various tasks (we report the evaluation accuracy for MNLI, F1 score for SQuAD, and LAS for Dep). The point at -1 corresponds to no frozen components. The graph shows that only a few unfrozen layers are needed to improve task performance, supporting the shallow processing conclusion.
  • Figure 4: Effects of fine-tuning at earlier layers of BERT. We note that the MNLI evaluation accuracy and SQuAD F1 score approach the full model performance by layer 6, whereas the dependency parsing LAS seems to require more layers. This support our hypothesis that the processing of dataset-specific features is shallow.
  • Figure 5: Comparison of the representations in the MNLI (left) and SQuAD (right) fine-tuned models and those of BERT Base, with the different lines corresponding to examples coming from various datasets. These graphs show that fine-tuning models only lead to shallow changes, consolidated to the last few layers. Also, we see that fine-tuning has a much greater impact on the token representations of in-domain data.