BeLLM: Backward Dependency Enhanced Large Language Model for Sentence Embeddings

Xianming Li; Jing Li

BeLLM: Backward Dependency Enhanced Large Language Model for Sentence Embeddings

Xianming Li, Jing Li

TL;DR

A novel model is proposed: backward dependency enhanced large language model (BeLLM), which learns sentence embeddings via transforming specific attention layers from uni- to bi-directional and achieves state-of-the-art performance in varying scenarios.

Abstract

Sentence embeddings are crucial in measuring semantic similarity. Most recent studies employed large language models (LLMs) to learn sentence embeddings. Existing LLMs mainly adopted autoregressive architecture without explicit backward dependency modeling. Therefore, we examined the effects of backward dependencies in LLMs for semantic similarity measurements. Concretely, we propose a novel model: backward dependency enhanced large language model (BeLLM). It learns sentence embeddings via transforming specific attention layers from uni- to bi-directional. We extensively experiment across various semantic textual similarity (STS) tasks and downstream applications. BeLLM achieves state-of-the-art performance in varying scenarios. It shows that auto-regressive LLMs benefit from backward dependencies for sentence embeddings.

BeLLM: Backward Dependency Enhanced Large Language Model for Sentence Embeddings

TL;DR

Abstract

Paper Structure (23 sections, 8 equations, 5 figures, 6 tables)

This paper contains 23 sections, 8 equations, 5 figures, 6 tables.

Introduction
Quantitative Pilot Analysis
BeLLM
Degradation Experiment
Model Architecture
Training Methods
Experimental Setup
Datasets.
Evaluation Metrics.
Baselines and Comparisons.
Model Settings.
Experimental Results
Main Comparison Results
Standard STS.
Conditional STS.
...and 8 more sections

Figures (5)

Figure 1: Two sample sentences $A$ and $B$ from STS-B dataset in dashed boxes. LLaMA predicted $0.8$ similarity for $A$ and $B$ without backward dependency modeling (in grey). The ground-truth similarity is $0.5$ because of differences in the playground in snow and shore.
Figure 2: Box plot of the sentence-level Spearman correlation on the STS-B test set. The average sentence-level Spearman correlations for LLaMA, ChatGLM, and BERT are about $0.17$, $0.15$, and $0.35$, respectively.
Figure 3: The overall framework of BeLLM. It includes three steps: 1) It first examines how to balance uni- and bi-directional layers with the degradation experiment and finds a turning point. 2) It transforms the attention layers after the turning point from uni- to bi-directional by removing the causal mask. 3) It employs contrastive learning to learn sentence embedding. Here, we visualize the dependencies of the representative word "play." LLM only captures the forward dependencies of "play" and BeLLM can capture both forward and backward dependencies.
Figure 4: Degradation results on the Standard STS benchmark. X-axis: the number of uni-directional layers. Y-axis: the average Spearman's correlations computed with SentEval conneau-kiela-2018-senteval. The down arrow indicates a dramatic performance drop.
Figure 5: The sentence-level Spearman correlation box plot of LLaMA and BeLLM on the STS-B test set.

BeLLM: Backward Dependency Enhanced Large Language Model for Sentence Embeddings

TL;DR

Abstract

BeLLM: Backward Dependency Enhanced Large Language Model for Sentence Embeddings

Authors

TL;DR

Abstract

Table of Contents

Figures (5)