Contrastive Learning in Distilled Models

Valerie Lim; Kai Wen Ng; Kenneth Lim

Contrastive Learning in Distilled Models

Valerie Lim, Kai Wen Ng, Kenneth Lim

TL;DR

The paper tackles the challenge of obtaining strong semantic textual similarity embeddings with lightweight models suitable for edge deployment, addressing BERT's STS limitations and model size. It applies SimCSE-style contrastive learning to DistilBERT (DistilFACE) using Wiki 1M for unsupervised pretraining and evaluates on STS datasets with Spearman correlation, while exploring efficiency enhancements like AMP and quantization. DistilFACE achieves an average Spearman correlation of 72.1 on STS tasks, a 34.2% improvement over BERT base, and a significant reduction in model size compared to BERT; it also provides detailed hyperparameter and pooling insights. The work demonstrates that contrastive learning is compatible with distilled architectures, enabling strong, edge-friendly semantic representations for retrieval and ranking applications, with practical implications for privacy-preserving and low-latency NLP systems.

Abstract

Natural Language Processing models like BERT can provide state-of-the-art word embeddings for downstream NLP tasks. However, these models yet to perform well on Semantic Textual Similarity, and may be too large to be deployed as lightweight edge applications. We seek to apply a suitable contrastive learning method based on the SimCSE paper, to a model architecture adapted from a knowledge distillation based model, DistilBERT, to address these two issues. Our final lightweight model DistilFace achieves an average of 72.1 in Spearman's correlation on STS tasks, a 34.2 percent improvement over BERT base.

Contrastive Learning in Distilled Models

TL;DR

Abstract

Paper Structure (24 sections, 1 equation, 6 figures, 10 tables)

This paper contains 24 sections, 1 equation, 6 figures, 10 tables.

Introduction
Problem Statement
Who Cares? The Implications
How Is It Done Today and Limits
Contrastive Learning Models
Knowledge Distillation Models
Combining Contrastive Learning with Knowledge Distillation
Pooling Methods
Approach
Training Dataset: Wiki 1M
Evaluation Dataset: STS Task Datasets
Methodology
Success Metrics
Further Enhancements
Results & Analysis
...and 9 more sections

Figures (6)

Figure 1: Overall DistilFACE Architecture. Similarity of final embeddings after the pooling layer are measured. Solid lines are positive examples, while dotted lines are negative examples.
Figure 2: Spearman Corr. by Steps on STS datasets
Figure 3: Spearman Corr. by Learning Rate on STS datasets
Figure 4: Spearman Corr. by Similarity Temperature on STS datasets
Figure 5: Spearman Corr. by Batch Size on STS datasets
...and 1 more figures

Contrastive Learning in Distilled Models

TL;DR

Abstract

Contrastive Learning in Distilled Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)