Table of Contents
Fetching ...

Description-Based Text Similarity

Shauli Ravfogel, Valentina Pyatkin, Amir DN Cohen, Avshalom Manevich, Yoav Goldberg

TL;DR

This work defines a concrete notion of similarity for information retrieval: description-based similarity, captured by the Abstract-Description Relation between an abstract query and a text's instantiation. It trains a dual-encoder system using GPT-3 generated descriptions to align texts with their descriptive queries via a combined triplet and InfoNCE objective, achieving embeddings that better support retrieval of content-based instantiations. Evaluations on a Wikipedia-scale corpus reveal substantial improvements over strong baselines in human and automatic metrics, particularly in precision for top results and robustness to distractors. The approach demonstrates the value of task-specific data and losses for practical information-seeking retrieval and suggests directions for broader adoption of description-driven search strategies.

Abstract

Identifying texts with a given semantics is central for many information seeking scenarios. Similarity search over vector embeddings appear to be central to this ability, yet the similarity reflected in current text embeddings is corpus-driven, and is inconsistent and sub-optimal for many use cases. What, then, is a good notion of similarity for effective retrieval of text? We identify the need to search for texts based on abstract descriptions of their content, and the corresponding notion of \emph{description based similarity}. We demonstrate the inadequacy of current text embeddings and propose an alternative model that significantly improves when used in standard nearest neighbor search. The model is trained using positive and negative pairs sourced through prompting a LLM, demonstrating how data from LLMs can be used for creating new capabilities not immediately possible using the original model.

Description-Based Text Similarity

TL;DR

This work defines a concrete notion of similarity for information retrieval: description-based similarity, captured by the Abstract-Description Relation between an abstract query and a text's instantiation. It trains a dual-encoder system using GPT-3 generated descriptions to align texts with their descriptive queries via a combined triplet and InfoNCE objective, achieving embeddings that better support retrieval of content-based instantiations. Evaluations on a Wikipedia-scale corpus reveal substantial improvements over strong baselines in human and automatic metrics, particularly in precision for top results and robustness to distractors. The approach demonstrates the value of task-specific data and losses for practical information-seeking retrieval and suggests directions for broader adoption of description-driven search strategies.

Abstract

Identifying texts with a given semantics is central for many information seeking scenarios. Similarity search over vector embeddings appear to be central to this ability, yet the similarity reflected in current text embeddings is corpus-driven, and is inconsistent and sub-optimal for many use cases. What, then, is a good notion of similarity for effective retrieval of text? We identify the need to search for texts based on abstract descriptions of their content, and the corresponding notion of \emph{description based similarity}. We demonstrate the inadequacy of current text embeddings and propose an alternative model that significantly improves when used in standard nearest neighbor search. The model is trained using positive and negative pairs sourced through prompting a LLM, demonstrating how data from LLMs can be used for creating new capabilities not immediately possible using the original model.
Paper Structure (32 sections, 3 equations, 5 figures, 3 tables)

This paper contains 32 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Top retrieval results from the Wikipedia Index. Ours: the model developed in this work. Existing: all-mpnet-base-v2, a strong sentence-similarity encoder.
  • Figure 2: Human evaluation results (\ref{['sec:human']}): number of times a given number of sentences was chosen per query instance: Our model (abstract-sim), averaged over all 4 baseline evaluations, vs. the baselines.
  • Figure 3: Precision automatic evaluation results (\ref{['sec:adv']}): precision@k curve for abstract-sim and the baselines. Vertical lines represents 1 standard deviation.
  • Figure 4: Recall automatic evaluation results (\ref{['sec:adv']}): valid-recall@k (left, higher is better) and invalid-recall@k (right, lower is better) for abstract-sim and the baselines. Vertical lines represent 1 standard deviation.
  • Figure 5: Ablation results on the automatic evaluation (\ref{['sec:adv']}).

Theorems & Definitions (1)

  • Definition 1: The Abstract-Description Relation