Description-Based Text Similarity
Shauli Ravfogel, Valentina Pyatkin, Amir DN Cohen, Avshalom Manevich, Yoav Goldberg
TL;DR
This work defines a concrete notion of similarity for information retrieval: description-based similarity, captured by the Abstract-Description Relation between an abstract query and a text's instantiation. It trains a dual-encoder system using GPT-3 generated descriptions to align texts with their descriptive queries via a combined triplet and InfoNCE objective, achieving embeddings that better support retrieval of content-based instantiations. Evaluations on a Wikipedia-scale corpus reveal substantial improvements over strong baselines in human and automatic metrics, particularly in precision for top results and robustness to distractors. The approach demonstrates the value of task-specific data and losses for practical information-seeking retrieval and suggests directions for broader adoption of description-driven search strategies.
Abstract
Identifying texts with a given semantics is central for many information seeking scenarios. Similarity search over vector embeddings appear to be central to this ability, yet the similarity reflected in current text embeddings is corpus-driven, and is inconsistent and sub-optimal for many use cases. What, then, is a good notion of similarity for effective retrieval of text? We identify the need to search for texts based on abstract descriptions of their content, and the corresponding notion of \emph{description based similarity}. We demonstrate the inadequacy of current text embeddings and propose an alternative model that significantly improves when used in standard nearest neighbor search. The model is trained using positive and negative pairs sourced through prompting a LLM, demonstrating how data from LLMs can be used for creating new capabilities not immediately possible using the original model.
