BMX: Entropy-weighted Similarity and Semantic-enhanced Lexical Search
Xianming Li, Julius Lipp, Aamir Shakir, Rui Huang, Jing Li
TL;DR
BM25 lacks query-document similarity and semantic understanding, limiting lexical retrieval. The paper introduces BM𝔛, which combines entropy-weighted similarity, one-shot weighted query augmentation, and score normalization to blend lexical efficiency with semantic cues. Across BEIR, LoCo, BRIGHT, and multilingual benchmarks, BM𝔛 consistently beats BM25 and, in several long-context and real-world settings, rivals or surpasses embedding-based methods, all within a scalable framework implemented in Baguetter. This work demonstrates that well-designed lexical-semantic hybrids can achieve strong IR performance without reliance on massive PLMs, offering practical benefits for real-world search systems and a bridge between classical and neural retrieval paradigms.
Abstract
BM25, a widely-used lexical search algorithm, remains crucial in information retrieval despite the rise of pre-trained and large language models (PLMs/LLMs). However, it neglects query-document similarity and lacks semantic understanding, limiting its performance. We revisit BM25 and introduce BMX, a novel extension of BM25 incorporating entropy-weighted similarity and semantic enhancement techniques. Extensive experiments demonstrate that BMX consistently outperforms traditional BM25 and surpasses PLM/LLM-based dense retrieval in long-context and real-world retrieval benchmarks. This study bridges the gap between classical lexical search and modern semantic approaches, offering a promising direction for future information retrieval research. The reference implementation of BMX can be found in Baguetter, which was created in the context of this work. The code can be found here: https://github.com/mixedbread-ai/baguetter.
