Table of Contents
Fetching ...

Exploring RWKV for Sentence Embeddings: Layer-wise Analysis and Baseline Comparison for Semantic Similarity

Xinghan Pan

TL;DR

This study evaluates RWKV, a linear-attention language model, as a source of zero-shot sentence embeddings. It conducts a layer-wise analysis across RWKV layers (1,3,5,7,9,11) and benchmarks against a 50-d GloVe baseline on the MRPC paraphrase task using Spearman correlation, while also profiling inference time and GPU memory. Results show that RWKV embeddings capture some semantic relatedness but underperform the GloVe baseline, with performance decreasing as layer depth increases and inference latency remaining substantially higher than the baseline. The work highlights the need for task-specific fine-tuning, improved pooling strategies, and empirical validation of RWKV’s theoretical efficiency to realize practical benefits in semantic representation tasks.

Abstract

This paper investigates the efficacy of RWKV, a novel language model architecture known for its linear attention mechanism, for generating sentence embeddings in a zero-shot setting. I conduct a layer-wise analysis to evaluate the semantic similarity captured by embeddings from different hidden layers of a pre-trained RWKV model. The performance is assessed on the Microsoft Research Paraphrase Corpus (MRPC) dataset using Spearman correlation and compared against a GloVe-based baseline. My results indicate that while RWKV embeddings capture some semantic relatedness, they underperform compared to the GloVe baseline in terms of Spearman correlation. I also analyze the inference time and GPU memory usage, highlighting the computational trade-offs associated with RWKV embeddings. The findings suggest that while RWKV offers potential advantages in terms of linear scaling, its zero-shot sentence embedding quality for semantic similarity tasks requires further investigation and potential task-specific fine-tuning to match or exceed simpler baselines.

Exploring RWKV for Sentence Embeddings: Layer-wise Analysis and Baseline Comparison for Semantic Similarity

TL;DR

This study evaluates RWKV, a linear-attention language model, as a source of zero-shot sentence embeddings. It conducts a layer-wise analysis across RWKV layers (1,3,5,7,9,11) and benchmarks against a 50-d GloVe baseline on the MRPC paraphrase task using Spearman correlation, while also profiling inference time and GPU memory. Results show that RWKV embeddings capture some semantic relatedness but underperform the GloVe baseline, with performance decreasing as layer depth increases and inference latency remaining substantially higher than the baseline. The work highlights the need for task-specific fine-tuning, improved pooling strategies, and empirical validation of RWKV’s theoretical efficiency to realize practical benefits in semantic representation tasks.

Abstract

This paper investigates the efficacy of RWKV, a novel language model architecture known for its linear attention mechanism, for generating sentence embeddings in a zero-shot setting. I conduct a layer-wise analysis to evaluate the semantic similarity captured by embeddings from different hidden layers of a pre-trained RWKV model. The performance is assessed on the Microsoft Research Paraphrase Corpus (MRPC) dataset using Spearman correlation and compared against a GloVe-based baseline. My results indicate that while RWKV embeddings capture some semantic relatedness, they underperform compared to the GloVe baseline in terms of Spearman correlation. I also analyze the inference time and GPU memory usage, highlighting the computational trade-offs associated with RWKV embeddings. The findings suggest that while RWKV offers potential advantages in terms of linear scaling, its zero-shot sentence embedding quality for semantic similarity tasks requires further investigation and potential task-specific fine-tuning to match or exceed simpler baselines.

Paper Structure

This paper contains 23 sections, 10 equations, 3 tables.