Table of Contents
Fetching ...

Retrieval Backward Attention without Additional Training: Enhance Embeddings of Large Language Models via Repetition

Yifei Duan, Raphael Shang, Deng Liang, Yongqiang Cai

TL;DR

The paper tackles improving embedding quality for decoder-only large language models in zero shot settings without retraining. It proposes ReBA, a context augmentation method that combines text repetition with backward attention by constructing a global symmetric attention matrix A new across all layers and heads, then computing updated token embeddings from the repeated sequence. Experiments on Chinese datasets show that ReBA significantly enhances sentence and word embeddings over classical and simple repetition baselines, with word-level gains being more dependent on backward attention. The approach yields practical benefits for zero shot semantic tasks while incurring additional computational overhead, and suggests future work to reduce this cost via subsequence based processing while maintaining embedding quality.

Abstract

Language models can be viewed as functions that embed text into Euclidean space, where the quality of the embedding vectors directly determines model performance, training such neural networks involves various uncertainties. This paper focuses on improving the performance of pre-trained language models in zero-shot settings through a simple and easily implementable method. We propose a novel backward attention mechanism to enhance contextual information encoding. Evaluated on the Chinese Massive Text Embedding Benchmark (C-MTEB), our approach achieves significant improvements across multiple tasks, providing valuable insights for advancing zero-shot learning capabilities.

Retrieval Backward Attention without Additional Training: Enhance Embeddings of Large Language Models via Repetition

TL;DR

The paper tackles improving embedding quality for decoder-only large language models in zero shot settings without retraining. It proposes ReBA, a context augmentation method that combines text repetition with backward attention by constructing a global symmetric attention matrix A new across all layers and heads, then computing updated token embeddings from the repeated sequence. Experiments on Chinese datasets show that ReBA significantly enhances sentence and word embeddings over classical and simple repetition baselines, with word-level gains being more dependent on backward attention. The approach yields practical benefits for zero shot semantic tasks while incurring additional computational overhead, and suggests future work to reduce this cost via subsequence based processing while maintaining embedding quality.

Abstract

Language models can be viewed as functions that embed text into Euclidean space, where the quality of the embedding vectors directly determines model performance, training such neural networks involves various uncertainties. This paper focuses on improving the performance of pre-trained language models in zero-shot settings through a simple and easily implementable method. We propose a novel backward attention mechanism to enhance contextual information encoding. Evaluated on the Chinese Massive Text Embedding Benchmark (C-MTEB), our approach achieves significant improvements across multiple tasks, providing valuable insights for advancing zero-shot learning capabilities.

Paper Structure

This paper contains 34 sections, 15 equations, 5 figures, 6 tables, 3 algorithms.

Figures (5)

  • Figure 1: Illustration of ReBA embedding: The classical embedding method captures only the contextual information preceding the token. In contrast, ReBA enhances the quality of target token embeddings by repeating the text $k-1$ times, computing a weighted sum of the target token's original embedding and subsequent token embeddings using backward attention weights. When the sentence appears $k$ times, the resulting embeddings are referred to as ReBA-$k$ embedding.
  • Figure 2: Illustration of Attention Relationships in BERT (Bidirectional Attention) and GPT (‌Causal Attention) with Corresponding Attention Matrix Representations
  • Figure 3: Performance on SLPWC and WSD tasks using Euclidean distances to evaluate word embeddings. The results show that ReBA encoding significantly enhances model performance on polysemous word understanding tasks. While performance fluctuates with the number of repetitions, increasing the repetition count does not necessarily lead to significant improvements. Based on this experiment, we observe that simple sentence repetition is not effective for improving word-level embeddings and only contributes to sentence-level understanding. Furthermore, the backward attention mechanism remains crucial for achieving further performance enhancements.
  • Figure 4: Information about C-MTEB, with most text lengths within 1000 tokens.
  • Figure 5: Performance on SLPWC and WSD tasks using Euclidean and Cosine distances to evaluate word embeddings, it shows that our results still hold under different distances,