Learning to Look at the Other Side: A Semantic Probing Study of Word Embeddings in LLMs with Enabled Bidirectional Attention
Zhaoxin Feng, Jianfei Ma, Emmanuele Chersoni, Xiaojing Zhao, Xiaoyi Bao
TL;DR
This paper investigates whether enabling bidirectional attention in decoder-only LLMs enhances word meaning representations for lexical semantic probing. By evaluating variants of Llama with Bi+MNTP and contrastive learning (unsupervised and supervised), the authors find that bidirectional attention alone does not consistently improve embeddings and may impair left-context utilization, while contrastive learning mitigates these issues and often achieves parity or superiority over BERT baselines. Across five probing tasks, autoregressive LLMs with appropriate bidirectional training and contrastive objectives perform competitively on both classification and regression semantic measures, and anisotropy analyses reveal nuanced effects of training strategy on embedding space geometry. The results highlight the potential of combining bidirectional context and contrastive learning to adapt autoregressive LLMs for word embeddings, while underscoring limitations related to anisotropy and language-scale generalizability.
Abstract
Autoregressive Large Language Models (LLMs) demonstrate exceptional performance in language understanding and generation. However, their application in text embedding tasks has been relatively slow, along with the analysis of their semantic representation in probing tasks, due to the constraints of the unidirectional attention mechanism. This paper aims to explore whether such constraints can be overcome by enabling bidirectional attention in LLMs. We tested different variants of the Llama architecture through additional training steps, progressively enabling bidirectional attention and unsupervised/supervised contrastive learning.
