ColBERT's [MASK]-based Query Augmentation: Effects of Quadrupling the Query Input Length
Ben Giacalone, Richard Zanibbi
TL;DR
This work investigates ColBERTv2's [MASK]-based query augmentation and its sensitivity to query length. By addressing whether [MASK] tokens primarily weight existing query terms and whether effectiveness scales with more [MASK] tokens, the authors perform controlled remapping and length-extension experiments across MS MARCO, TREC 2019-2020, and TREC COVID, using the MaxSim scoring framework $S_{d,q} = \sum_{i \in [|E_q|]} \max_{j \in [|E_d|]} E_{q_i} \cdot E_{d_j}^T$. They find that [MASK] embeddings tend to weight non-[MASK] terms rather than introduce new terms, with remapping generally reducing performance and the best remapping still under baseline; the optimal gains occur near the trained length of 32 tokens, after which performance plateaus, though extending to 128 tokens yields small yet sometimes significant improvements on COVID data. Overall, ColBERTv2 demonstrates robustness to longer [MASK]-augmented queries and reinforces the role of [MASK] padding as a controlled term-weighting mechanism that enhances ranking while preserving stability beyond training conditions. These findings inform prompt design and query augmentation strategies for transformer-based IR models in practical search scenarios.
Abstract
A unique aspect of ColBERT is its use of [MASK] tokens in queries to score documents (query augmentation). Prior work shows [MASK] tokens weighting non-[MASK] query terms, emphasizing certain tokens over others , rather than introducing whole new terms as initially proposed. We begin by demonstrating that a term weighting behavior previously reported for [MASK] tokens in ColBERTv1 holds for ColBERTv2. We then examine the effect of changing the number of [MASK] tokens from zero to up to four times past the query input length used in training, both for first stage retrieval, and for scoring candidates, observing an initial decrease in performance with few [MASK]s, a large increase when enough [MASK]s are added to pad queries to an average length of 32, then a plateau in performance afterwards. Additionally, we compare baseline performance to performance when the query length is extended to 128 tokens, and find that differences are small (e.g., within 1% on various metrics) and generally statistically insignificant, indicating performance does not collapse if ColBERT is presented with more [MASK] tokens than expected.
