Table of Contents
Fetching ...

Learning to Look at the Other Side: A Semantic Probing Study of Word Embeddings in LLMs with Enabled Bidirectional Attention

Zhaoxin Feng, Jianfei Ma, Emmanuele Chersoni, Xiaojing Zhao, Xiaoyi Bao

TL;DR

This paper investigates whether enabling bidirectional attention in decoder-only LLMs enhances word meaning representations for lexical semantic probing. By evaluating variants of Llama with Bi+MNTP and contrastive learning (unsupervised and supervised), the authors find that bidirectional attention alone does not consistently improve embeddings and may impair left-context utilization, while contrastive learning mitigates these issues and often achieves parity or superiority over BERT baselines. Across five probing tasks, autoregressive LLMs with appropriate bidirectional training and contrastive objectives perform competitively on both classification and regression semantic measures, and anisotropy analyses reveal nuanced effects of training strategy on embedding space geometry. The results highlight the potential of combining bidirectional context and contrastive learning to adapt autoregressive LLMs for word embeddings, while underscoring limitations related to anisotropy and language-scale generalizability.

Abstract

Autoregressive Large Language Models (LLMs) demonstrate exceptional performance in language understanding and generation. However, their application in text embedding tasks has been relatively slow, along with the analysis of their semantic representation in probing tasks, due to the constraints of the unidirectional attention mechanism. This paper aims to explore whether such constraints can be overcome by enabling bidirectional attention in LLMs. We tested different variants of the Llama architecture through additional training steps, progressively enabling bidirectional attention and unsupervised/supervised contrastive learning.

Learning to Look at the Other Side: A Semantic Probing Study of Word Embeddings in LLMs with Enabled Bidirectional Attention

TL;DR

This paper investigates whether enabling bidirectional attention in decoder-only LLMs enhances word meaning representations for lexical semantic probing. By evaluating variants of Llama with Bi+MNTP and contrastive learning (unsupervised and supervised), the authors find that bidirectional attention alone does not consistently improve embeddings and may impair left-context utilization, while contrastive learning mitigates these issues and often achieves parity or superiority over BERT baselines. Across five probing tasks, autoregressive LLMs with appropriate bidirectional training and contrastive objectives perform competitively on both classification and regression semantic measures, and anisotropy analyses reveal nuanced effects of training strategy on embedding space geometry. The results highlight the potential of combining bidirectional context and contrastive learning to adapt autoregressive LLMs for word embeddings, while underscoring limitations related to anisotropy and language-scale generalizability.

Abstract

Autoregressive Large Language Models (LLMs) demonstrate exceptional performance in language understanding and generation. However, their application in text embedding tasks has been relatively slow, along with the analysis of their semantic representation in probing tasks, due to the constraints of the unidirectional attention mechanism. This paper aims to explore whether such constraints can be overcome by enabling bidirectional attention in LLMs. We tested different variants of the Llama architecture through additional training steps, progressively enabling bidirectional attention and unsupervised/supervised contrastive learning.

Paper Structure

This paper contains 21 sections, 2 equations, 8 figures, 16 tables.

Figures (8)

  • Figure 1: Comparison of attention mechanisms in Llama and BERT models. (a) shows Llama's unidirectional attention where prediction (orange arrows) and word representation (blue arrows) can only access one side context; (b) shows BERT's bidirectional attention where masked language modeling allows word representation to access both previous and subsequent context.
  • Figure 2: Summary of five semantic probing tasks in our study: Tasks 1, 2, and 5 are classification tasks, with 0 and 1 denoting binary labels; Tasks 3 and 4 are regression tasks, where numbers (eg. 5, 7) indicate continuous values for the target variables. The “probed word” (highlighted) refers to the word whose contextualized representation is extracted for the probing task.
  • Figure 3: Results of predicting subject animacy, verb causative/dynamic, and object animacy using each word in a sentence as probed words, scores extracted with Sheared-Llama-1.3B and its variants. The horizontal axis represents word indices in sentences (all with identical five-word syntactic structures).
  • Figure 4: Results of subject animacy subtask in Task 1 by comparing Sheared-Llama-1.3B to Llama2-7B.
  • Figure 5: Results of verb telicity and duration (Task 2).
  • ...and 3 more figures