Table of Contents
Fetching ...

Are ELECTRA's Sentence Embeddings Beyond Repair? The Case of Semantic Textual Similarity

Ivan Rep, David Dukić, Jan Šnajder

TL;DR

The surprising efficacy of ELECTRA's generator model, which performs on par with BERT, using significantly fewer parameters and a substantially smaller embedding size is discovered, and boosts by combining TMFT with word similarity or domain adaptive pre-training are observed.

Abstract

While BERT produces high-quality sentence embeddings, its pre-training computational cost is a significant drawback. In contrast, ELECTRA provides a cost-effective pre-training objective and downstream task performance improvements, but worse sentence embeddings. The community tacitly stopped utilizing ELECTRA's sentence embeddings for semantic textual similarity (STS). We notice a significant drop in performance for the ELECTRA discriminator's last layer in comparison to prior layers. We explore this drop and propose a way to repair the embeddings using a novel truncated model fine-tuning (TMFT) method. TMFT improves the Spearman correlation coefficient by over $8$ points while increasing parameter efficiency on the STS Benchmark. We extend our analysis to various model sizes, languages, and two other tasks. Further, we discover the surprising efficacy of ELECTRA's generator model, which performs on par with BERT, using significantly fewer parameters and a substantially smaller embedding size. Finally, we observe boosts by combining TMFT with word similarity or domain adaptive pre-training.

Are ELECTRA's Sentence Embeddings Beyond Repair? The Case of Semantic Textual Similarity

TL;DR

The surprising efficacy of ELECTRA's generator model, which performs on par with BERT, using significantly fewer parameters and a substantially smaller embedding size is discovered, and boosts by combining TMFT with word similarity or domain adaptive pre-training are observed.

Abstract

While BERT produces high-quality sentence embeddings, its pre-training computational cost is a significant drawback. In contrast, ELECTRA provides a cost-effective pre-training objective and downstream task performance improvements, but worse sentence embeddings. The community tacitly stopped utilizing ELECTRA's sentence embeddings for semantic textual similarity (STS). We notice a significant drop in performance for the ELECTRA discriminator's last layer in comparison to prior layers. We explore this drop and propose a way to repair the embeddings using a novel truncated model fine-tuning (TMFT) method. TMFT improves the Spearman correlation coefficient by over points while increasing parameter efficiency on the STS Benchmark. We extend our analysis to various model sizes, languages, and two other tasks. Further, we discover the surprising efficacy of ELECTRA's generator model, which performs on par with BERT, using significantly fewer parameters and a substantially smaller embedding size. Finally, we observe boosts by combining TMFT with word similarity or domain adaptive pre-training.
Paper Structure (16 sections, 13 figures, 2 tables)

This paper contains 16 sections, 13 figures, 2 tables.

Figures (13)

  • Figure 1: A method for improving sentence embeddings with (2) TMFT on STS. We apply mean pooling over the embeddings at layer $l$ and fine-tune. One of the combinations can also be added for further improvement: (1a) TMFT on word similarity, or (1b) DAPT using MLM.
  • Figure 2: Test set Spearman correlation coefficients on STSB using TMFT with and without improvements (shaded area is the standard deviation). Subfigure \ref{['fig:comparison-plot']} presents results using TMFT on STSB, \ref{['fig:comparison-plot-wordsim']} shows TMFT on STSB with prior TMFT on word similarity, and \ref{['fig:comparison-plot-mlm']} depicts TMFT on STSB with prior MLM. More details are in Table \ref{['tab:model-comparison']}.
  • Figure 3: The result of applying CKA on the hidden layer representations of the STSB test set at a layer with a certain index. Subfigure \ref{['fig:cka_g_bert']} presents the comparison between ELECTRA generator and BERT, subfigure \ref{['fig:cka_d_bert']} the comparison between ELECTRA discriminator and BERT, and subfigure \ref{['fig:cka_deberta_bert']} the comparison between DeBERTaV3 discriminator and BERT.
  • Figure 4: Comparison of the number of parameters of the model and the test set Spearman correlation coefficients. The shown models have the highest validation Spearman correlation coefficient value. The figure also includes the last layer representations that do not correspond to the highest validation Spearman correlation coefficient. ELECTRA$_{\text{large}}$ discriminator is excluded as its value is too small ($25.84$). The gray line indicates the Pareto front. For detailed test set Spearman correlation coefficient values, refer to Table \ref{['tab:pareto-table']} in Appendix \ref{['appendix:D']}.
  • Figure 5: Test set F1 scores on the MRPC dataset across all layers.
  • ...and 8 more figures