Are ELECTRA's Sentence Embeddings Beyond Repair? The Case of Semantic Textual Similarity

Ivan Rep; David Dukić; Jan Šnajder

Are ELECTRA's Sentence Embeddings Beyond Repair? The Case of Semantic Textual Similarity

Ivan Rep, David Dukić, Jan Šnajder

TL;DR

The surprising efficacy of ELECTRA's generator model, which performs on par with BERT, using significantly fewer parameters and a substantially smaller embedding size is discovered, and boosts by combining TMFT with word similarity or domain adaptive pre-training are observed.

Abstract

While BERT produces high-quality sentence embeddings, its pre-training computational cost is a significant drawback. In contrast, ELECTRA provides a cost-effective pre-training objective and downstream task performance improvements, but worse sentence embeddings. The community tacitly stopped utilizing ELECTRA's sentence embeddings for semantic textual similarity (STS). We notice a significant drop in performance for the ELECTRA discriminator's last layer in comparison to prior layers. We explore this drop and propose a way to repair the embeddings using a novel truncated model fine-tuning (TMFT) method. TMFT improves the Spearman correlation coefficient by over $8$ points while increasing parameter efficiency on the STS Benchmark. We extend our analysis to various model sizes, languages, and two other tasks. Further, we discover the surprising efficacy of ELECTRA's generator model, which performs on par with BERT, using significantly fewer parameters and a substantially smaller embedding size. Finally, we observe boosts by combining TMFT with word similarity or domain adaptive pre-training.

Are ELECTRA's Sentence Embeddings Beyond Repair? The Case of Semantic Textual Similarity

TL;DR

Abstract

points while increasing parameter efficiency on the STS Benchmark. We extend our analysis to various model sizes, languages, and two other tasks. Further, we discover the surprising efficacy of ELECTRA's generator model, which performs on par with BERT, using significantly fewer parameters and a substantially smaller embedding size. Finally, we observe boosts by combining TMFT with word similarity or domain adaptive pre-training.

Paper Structure (16 sections, 13 figures, 2 tables)

This paper contains 16 sections, 13 figures, 2 tables.

Introduction
Related Work
Truncated Model Fine-Tuning
Experiments and Results
ELECTRA
Further Improvements
Parameter-Performance Trade-off
Performance Drop Analysis
Conclusion
Limitations
Performance on Additional Tasks
Performance on STSB in Different Languages
Performance for Various Model Sizes on STSB
Overview of Best-Performing Models Using TMFT on STSB
TMFT on Randomly Initialized Models
...and 1 more sections

Figures (13)

Figure 1: A method for improving sentence embeddings with (2) TMFT on STS. We apply mean pooling over the embeddings at layer $l$ and fine-tune. One of the combinations can also be added for further improvement: (1a) TMFT on word similarity, or (1b) DAPT using MLM.
Figure 2: Test set Spearman correlation coefficients on STSB using TMFT with and without improvements (shaded area is the standard deviation). Subfigure \ref{['fig:comparison-plot']} presents results using TMFT on STSB, \ref{['fig:comparison-plot-wordsim']} shows TMFT on STSB with prior TMFT on word similarity, and \ref{['fig:comparison-plot-mlm']} depicts TMFT on STSB with prior MLM. More details are in Table \ref{['tab:model-comparison']}.
Figure 3: The result of applying CKA on the hidden layer representations of the STSB test set at a layer with a certain index. Subfigure \ref{['fig:cka_g_bert']} presents the comparison between ELECTRA generator and BERT, subfigure \ref{['fig:cka_d_bert']} the comparison between ELECTRA discriminator and BERT, and subfigure \ref{['fig:cka_deberta_bert']} the comparison between DeBERTaV3 discriminator and BERT.
Figure 4: Comparison of the number of parameters of the model and the test set Spearman correlation coefficients. The shown models have the highest validation Spearman correlation coefficient value. The figure also includes the last layer representations that do not correspond to the highest validation Spearman correlation coefficient. ELECTRA$_{\text{large}}$ discriminator is excluded as its value is too small ($25.84$). The gray line indicates the Pareto front. For detailed test set Spearman correlation coefficient values, refer to Table \ref{['tab:pareto-table']} in Appendix \ref{['appendix:D']}.
Figure 5: Test set F1 scores on the MRPC dataset across all layers.
...and 8 more figures

Are ELECTRA's Sentence Embeddings Beyond Repair? The Case of Semantic Textual Similarity

TL;DR

Abstract

Are ELECTRA's Sentence Embeddings Beyond Repair? The Case of Semantic Textual Similarity

Authors

TL;DR

Abstract

Table of Contents

Figures (13)