Table of Contents
Fetching ...

Layer-Aware Embedding Fusion for LLMs in Text Classifications

Jiho Gwak, Yuchul Jung

TL;DR

This paper tackles how to effectively use embeddings from decoder-based LLMs for text classification by introducing layer-aware embedding selection and multi-model fusion without fine-tuning. It systematically evaluates which layers carry the most discriminative information and demonstrates that optimal layers vary across datasets, with mid-to-late layers often outperforming the final layer. The study further shows that fusing embeddings from multiple models can yield gains when the models are complementary, though memory and compute costs rise with more models and layers. Overall, the work provides practical guidelines for designing scalable, task- and dataset-aware embedding fusion systems in real-world NLP tasks.

Abstract

Embedding fusion has emerged as an effective approach for enhancing performance across various NLP tasks. However, systematic guidelines for selecting optimal layers and developing effective fusion strategies for the integration of LLMs remain underexplored. In this study, we propose a layer-aware embedding selection method and investigate how to quantitatively evaluate different layers to identify the most important ones for downstream NLP tasks, showing that the critical layers vary depending on the dataset. We also explore how combining embeddings from multiple LLMs, without requiring model fine-tuning, can improve performance. Experiments on four English text classification datasets (SST-2, MR, R8, and R52) demonstrate that different layers in LLMs exhibit varying degrees of representational strength for classification, and that combining embeddings from different models can enhance performance if the models exhibit complementary characteristics. Additionally, we discuss resources overhead (memory and inference time) to provide a balanced perspective on the real world feasibility of embedding fusion. Future work will explore multilingual and domain specific datasets, as well as techniques for automating layer selection, to improve both performance and scalability.

Layer-Aware Embedding Fusion for LLMs in Text Classifications

TL;DR

This paper tackles how to effectively use embeddings from decoder-based LLMs for text classification by introducing layer-aware embedding selection and multi-model fusion without fine-tuning. It systematically evaluates which layers carry the most discriminative information and demonstrates that optimal layers vary across datasets, with mid-to-late layers often outperforming the final layer. The study further shows that fusing embeddings from multiple models can yield gains when the models are complementary, though memory and compute costs rise with more models and layers. Overall, the work provides practical guidelines for designing scalable, task- and dataset-aware embedding fusion systems in real-world NLP tasks.

Abstract

Embedding fusion has emerged as an effective approach for enhancing performance across various NLP tasks. However, systematic guidelines for selecting optimal layers and developing effective fusion strategies for the integration of LLMs remain underexplored. In this study, we propose a layer-aware embedding selection method and investigate how to quantitatively evaluate different layers to identify the most important ones for downstream NLP tasks, showing that the critical layers vary depending on the dataset. We also explore how combining embeddings from multiple LLMs, without requiring model fine-tuning, can improve performance. Experiments on four English text classification datasets (SST-2, MR, R8, and R52) demonstrate that different layers in LLMs exhibit varying degrees of representational strength for classification, and that combining embeddings from different models can enhance performance if the models exhibit complementary characteristics. Additionally, we discuss resources overhead (memory and inference time) to provide a balanced perspective on the real world feasibility of embedding fusion. Future work will explore multilingual and domain specific datasets, as well as techniques for automating layer selection, to improve both performance and scalability.

Paper Structure

This paper contains 27 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: The embeddings extracted from the LLM are mapped to a unified dimension using a linear projection, after which various fusion techniques are applied. The selected layer n from the generation models represents the most informative layer for classification The two embeddings, normalized to dimension $d_{M(M=1024)}$, are then combined using a specific fusion strategy to generate a new representation. Finally, this fused embedding is fed into the classifier head, which is trained to optimize classification performance.
  • Figure 2: Comparison of Single-Layer Embedding Classification Performance in Decoder-Based LLMs
  • Figure 3: Comparison of Averaged Layer Embedding Fusion in Decoder-Based LLMs