Table of Contents
Fetching ...

Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning

Beliz Gunel, Jingfei Du, Alexis Conneau, Ves Stoyanov

TL;DR

The paper tackles instability and limited generalization in fine-tuning pre-trained language models with cross-entropy by introducing a supervised contrastive learning objective. By combining cross-entropy with a contrastive term applied to $l_2$-normalized CLS representations and calibrated by a temperature parameter, the method yields substantial gains in few-shot scenarios on GLUE, while remaining robust to noisy data and transferable to related tasks. The results show meaningful improvements on several GLUE tasks, with analysis supporting tighter intra-class clustering and demonstrated generalization benefits, all without extra data, memory banks, or data augmentations. Overall, the approach offers a principled fine-tuning objective that enhances robustness and transferability in low-resource NLP settings.

Abstract

State-of-the-art natural language understanding classification models follow two-stages: pre-training a large language model on an auxiliary task, and then fine-tuning the model on a task-specific labeled dataset using cross-entropy loss. However, the cross-entropy loss has several shortcomings that can lead to sub-optimal generalization and instability. Driven by the intuition that good generalization requires capturing the similarity between examples in one class and contrasting them with examples in other classes, we propose a supervised contrastive learning (SCL) objective for the fine-tuning stage. Combined with cross-entropy, our proposed SCL loss obtains significant improvements over a strong RoBERTa-Large baseline on multiple datasets of the GLUE benchmark in few-shot learning settings, without requiring specialized architecture, data augmentations, memory banks, or additional unsupervised data. Our proposed fine-tuning objective leads to models that are more robust to different levels of noise in the fine-tuning training data, and can generalize better to related tasks with limited labeled data.

Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning

TL;DR

The paper tackles instability and limited generalization in fine-tuning pre-trained language models with cross-entropy by introducing a supervised contrastive learning objective. By combining cross-entropy with a contrastive term applied to -normalized CLS representations and calibrated by a temperature parameter, the method yields substantial gains in few-shot scenarios on GLUE, while remaining robust to noisy data and transferable to related tasks. The results show meaningful improvements on several GLUE tasks, with analysis supporting tighter intra-class clustering and demonstrated generalization benefits, all without extra data, memory banks, or data augmentations. Overall, the approach offers a principled fine-tuning objective that enhances robustness and transferability in low-resource NLP settings.

Abstract

State-of-the-art natural language understanding classification models follow two-stages: pre-training a large language model on an auxiliary task, and then fine-tuning the model on a task-specific labeled dataset using cross-entropy loss. However, the cross-entropy loss has several shortcomings that can lead to sub-optimal generalization and instability. Driven by the intuition that good generalization requires capturing the similarity between examples in one class and contrasting them with examples in other classes, we propose a supervised contrastive learning (SCL) objective for the fine-tuning stage. Combined with cross-entropy, our proposed SCL loss obtains significant improvements over a strong RoBERTa-Large baseline on multiple datasets of the GLUE benchmark in few-shot learning settings, without requiring specialized architecture, data augmentations, memory banks, or additional unsupervised data. Our proposed fine-tuning objective leads to models that are more robust to different levels of noise in the fine-tuning training data, and can generalize better to related tasks with limited labeled data.

Paper Structure

This paper contains 13 sections, 2 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Our proposed objective includes a cross-entropy term (CE) and a supervised contrastive learning (SCL) term, and it is formulated to push examples from the same class close and examples from different classes further apart. We show examples from the SST-2 sentiment analysis dataset from the GLUE benchmark, where class A (shown in red) is negative movie reviews and class B (shown in blue) is positive movie reviews. Although we show a binary classification case for simplicity, the loss is generally applicable to any multi-class classification setting.
  • Figure 2: tSNE plots of the learned CLS embeddings on the SST-2 test set in the few-shot learning setting of having 20 labeled examples to fine-tune on -- comparing RoBERTa-Large fine-tuned with CE only (left) and with our proposed objective CE+SCL (right) for the SST-2 sentiment analysis task. Blue: positive examples; red: negative examples.
  • Figure 3: tSNE plots of learned CLS embedding on SST-2 test set where we have 20, 100 labeled examples, and full dataset respectively, comparing CE with and without SCL term. Blue: positive examples; red: negative examples.