GEAR: A Simple GENERATE, EMBED, AVERAGE AND RANK Approach for Unsupervised Reverse Dictionary
Fatemah Almeman, Luis Espinosa-Anke
TL;DR
The paper tackles reverse dictionary by introducing GEAR, a simple, unsupervised pipeline that uses an LLM to generate candidate terms from a definition, embeds these candidates, averages their embeddings, and ranks dictionary entries via cosine similarity with a KNN search. This generate-embed-average-rank approach achieves state-of-the-art performance on Hill's dataset (two of three test splits) and shows robust generalization across a diverse set of dictionaries in 3D-EX, outperforming heavily tuned supervised baselines. Key findings include the superiority of simple averaging over pooling variants, the importance of prompt design when used with embeddings, and the strong contribution of dictionary-aware embeddings like Instructor. The work demonstrates that combining LLM generation with varied embeddings yields scalable, domain-robust reverse dictionary capabilities, with potential applications in accessibility, translation, and language learning, and suggests directions for multilingual extension and improved weighting strategies.$
Abstract
Reverse Dictionary (RD) is the task of obtaining the most relevant word or set of words given a textual description or dictionary definition. Effective RD methods have applications in accessibility, translation or writing support systems. Moreover, in NLP research we find RD to be used to benchmark text encoders at various granularities, as it often requires word, definition and sentence embeddings. In this paper, we propose a simple approach to RD that leverages LLMs in combination with embedding models. Despite its simplicity, this approach outperforms supervised baselines in well studied RD datasets, while also showing less over-fitting. We also conduct a number of experiments on different dictionaries and analyze how different styles, registers and target audiences impact the quality of RD systems. We conclude that, on average, untuned embeddings alone fare way below an LLM-only baseline (although they are competitive in highly technical dictionaries), but are crucial for boosting performance in combined methods.
