Table of Contents
Fetching ...

HEAL: Hierarchical Embedding Alignment Loss for Improved Retrieval and Representation Learning

Manish Bhattarai, Ryan Barron, Maksim Eren, Minh Vu, Vesselin Grantcharov, Ismael Boureima, Valentin Stanev, Cynthia Matuszek, Vladimir Valtchinov, Kim Rasmussen, Boian Alexandrov

TL;DR

HEAL tackles hallucinations in retrieval-augmented generation by aligning domain-specific embeddings with hierarchical content using Hierarchical Non-negative Matrix Factorization (HNMFk) to obtain multi-level cluster labels. It introduces a Hierarchical Multilevel Contrastive Loss (HEAL) that computes level-wise losses and depth-dependent penalties, optimizing an overall loss L_HEAL = (1/N) sum_{l=0}^{L-1} λ_l sum_{i=1}^N L_{i,l}. The embedding model is fine-tuned with HEAL on domain-specific corpora and augmented with Q&A data generated from LLMs to jointly align documents and queries in the embedding space, improving retrieval, classification, and reducing hallucinations across Healthcare, Materials Science, Applied Mathematics, and Cybersecurity. Experimental results show consistent improvements in hierarchical metrics, retrieval precision, and hallucination reduction, including near-perfect Material Science retrieval and substantial gains in other domains, validating HEAL as a scalable, domain-adaptive enhancement for RAG systems.

Abstract

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external document retrieval to provide domain-specific or up-to-date knowledge. The effectiveness of RAG depends on the relevance of retrieved documents, which is influenced by the semantic alignment of embeddings with the domain's specialized content. Although full fine-tuning can align language models to specific domains, it is computationally intensive and demands substantial data. This paper introduces Hierarchical Embedding Alignment Loss (HEAL), a novel method that leverages hierarchical fuzzy clustering with matrix factorization within contrastive learning to efficiently align LLM embeddings with domain-specific content. HEAL computes level/depth-wise contrastive losses and incorporates hierarchical penalties to align embeddings with the underlying relationships in label hierarchies. This approach enhances retrieval relevance and document classification, effectively reducing hallucinations in LLM outputs. In our experiments, we benchmark and evaluate HEAL across diverse domains, including Healthcare, Material Science, Cyber-security, and Applied Maths.

HEAL: Hierarchical Embedding Alignment Loss for Improved Retrieval and Representation Learning

TL;DR

HEAL tackles hallucinations in retrieval-augmented generation by aligning domain-specific embeddings with hierarchical content using Hierarchical Non-negative Matrix Factorization (HNMFk) to obtain multi-level cluster labels. It introduces a Hierarchical Multilevel Contrastive Loss (HEAL) that computes level-wise losses and depth-dependent penalties, optimizing an overall loss L_HEAL = (1/N) sum_{l=0}^{L-1} λ_l sum_{i=1}^N L_{i,l}. The embedding model is fine-tuned with HEAL on domain-specific corpora and augmented with Q&A data generated from LLMs to jointly align documents and queries in the embedding space, improving retrieval, classification, and reducing hallucinations across Healthcare, Materials Science, Applied Mathematics, and Cybersecurity. Experimental results show consistent improvements in hierarchical metrics, retrieval precision, and hallucination reduction, including near-perfect Material Science retrieval and substantial gains in other domains, validating HEAL as a scalable, domain-adaptive enhancement for RAG systems.

Abstract

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external document retrieval to provide domain-specific or up-to-date knowledge. The effectiveness of RAG depends on the relevance of retrieved documents, which is influenced by the semantic alignment of embeddings with the domain's specialized content. Although full fine-tuning can align language models to specific domains, it is computationally intensive and demands substantial data. This paper introduces Hierarchical Embedding Alignment Loss (HEAL), a novel method that leverages hierarchical fuzzy clustering with matrix factorization within contrastive learning to efficiently align LLM embeddings with domain-specific content. HEAL computes level/depth-wise contrastive losses and incorporates hierarchical penalties to align embeddings with the underlying relationships in label hierarchies. This approach enhances retrieval relevance and document classification, effectively reducing hallucinations in LLM outputs. In our experiments, we benchmark and evaluate HEAL across diverse domains, including Healthcare, Material Science, Cyber-security, and Applied Maths.

Paper Structure

This paper contains 13 sections, 9 equations, 2 figures, 2 tables, 1 algorithm.

Figures (2)

  • Figure 1: Overview of the HEAL-Based Embedding Model Alignment and Retrieval. The left side illustrates hierarchical label generation using HNMF, where documents corresponding to a cluster from each preceding depth are converted into TFIDF matrices and further decomposed to extract sub-clusters. The TSNE visualizations highlighting cluster memberships in document embeddings. The right side depicts fine-tuning of the SciNCL model using HEAL loss on generated embeddings and HNMF derived labels. Once trained, the aligned model computes a vector store from the corpus, enabling retrieval of the nearest $p$ documents for a given query embedding.
  • Figure 2: Embedding visualizations for different datasets, projected using t-SNE for dimensionality reduction. The density contours represent the kernel density estimation (KDE) of the embeddings in the 2D space, highlighting the clustering structure. Subplots show the Material dataset (a) before and (b) after model alignment, and the Healthcare dataset (c) before and (d) after model alignment. The contours illustrate the density distribution of embeddings, showcasing the effect of alignment on cluster compactness and separation.