Table of Contents
Fetching ...

Mind the Gap: A Generalized Approach for Cross-Modal Embedding Alignment

Arihan Yadav, Alan McMillan

Abstract

Retrieval-Augmented Generation (RAG) systems enhance text generation by incorporating external knowledge but often struggle when retrieving context across different text modalities due to semantic gaps. We introduce a generalized projection-based method, inspired by adapter modules in transfer learning, that efficiently bridges these gaps between various text types, such as programming code and pseudocode, or English and French sentences. Our approach emphasizes speed, accuracy, and data efficiency, requiring minimal resources for training and inference. By aligning embeddings from heterogeneous text modalities into a unified space through a lightweight projection network, our model significantly outperforms traditional retrieval methods like the Okapi BM25 algorithm and models like Dense Passage Retrieval (DPR), while approaching the accuracy of Sentence Transformers. Extensive evaluations demonstrate the effectiveness and generalizability of our method across different tasks, highlighting its potential for real-time, resource-constrained applications.

Mind the Gap: A Generalized Approach for Cross-Modal Embedding Alignment

Abstract

Retrieval-Augmented Generation (RAG) systems enhance text generation by incorporating external knowledge but often struggle when retrieving context across different text modalities due to semantic gaps. We introduce a generalized projection-based method, inspired by adapter modules in transfer learning, that efficiently bridges these gaps between various text types, such as programming code and pseudocode, or English and French sentences. Our approach emphasizes speed, accuracy, and data efficiency, requiring minimal resources for training and inference. By aligning embeddings from heterogeneous text modalities into a unified space through a lightweight projection network, our model significantly outperforms traditional retrieval methods like the Okapi BM25 algorithm and models like Dense Passage Retrieval (DPR), while approaching the accuracy of Sentence Transformers. Extensive evaluations demonstrate the effectiveness and generalizability of our method across different tasks, highlighting its potential for real-time, resource-constrained applications.

Paper Structure

This paper contains 24 sections, 3 equations, 3 figures, 6 tables, 2 algorithms.

Figures (3)

  • Figure 1: Visual representation of the comparison between the referenced models above and the process followed by our projection-based approach.
  • Figure 2: Architecture of the Projection Model aligning embeddings from Modality B (e.g., pseudocode) to Modality A (e.g., programming code) embedding space.
  • Figure 3: An example representation of positive and negative pairs in the embedding space. The loss is calculated based on the distances between the anchor and the positive point, and the anchor with the negative points.