ColBERT's [MASK]-based Query Augmentation: Effects of Quadrupling the Query Input Length

Ben Giacalone; Richard Zanibbi

ColBERT's [MASK]-based Query Augmentation: Effects of Quadrupling the Query Input Length

Ben Giacalone, Richard Zanibbi

TL;DR

This work investigates ColBERTv2's [MASK]-based query augmentation and its sensitivity to query length. By addressing whether [MASK] tokens primarily weight existing query terms and whether effectiveness scales with more [MASK] tokens, the authors perform controlled remapping and length-extension experiments across MS MARCO, TREC 2019-2020, and TREC COVID, using the MaxSim scoring framework $S_{d,q} = \sum_{i \in [|E_q|]} \max_{j \in [|E_d|]} E_{q_i} \cdot E_{d_j}^T$. They find that [MASK] embeddings tend to weight non-[MASK] terms rather than introduce new terms, with remapping generally reducing performance and the best remapping still under baseline; the optimal gains occur near the trained length of 32 tokens, after which performance plateaus, though extending to 128 tokens yields small yet sometimes significant improvements on COVID data. Overall, ColBERTv2 demonstrates robustness to longer [MASK]-augmented queries and reinforces the role of [MASK] padding as a controlled term-weighting mechanism that enhances ranking while preserving stability beyond training conditions. These findings inform prompt design and query augmentation strategies for transformer-based IR models in practical search scenarios.

Abstract

A unique aspect of ColBERT is its use of [MASK] tokens in queries to score documents (query augmentation). Prior work shows [MASK] tokens weighting non-[MASK] query terms, emphasizing certain tokens over others , rather than introducing whole new terms as initially proposed. We begin by demonstrating that a term weighting behavior previously reported for [MASK] tokens in ColBERTv1 holds for ColBERTv2. We then examine the effect of changing the number of [MASK] tokens from zero to up to four times past the query input length used in training, both for first stage retrieval, and for scoring candidates, observing an initial decrease in performance with few [MASK]s, a large increase when enough [MASK]s are added to pad queries to an average length of 32, then a plateau in performance afterwards. Additionally, we compare baseline performance to performance when the query length is extended to 128 tokens, and find that differences are small (e.g., within 1% on various metrics) and generally statistically insignificant, indicating performance does not collapse if ColBERT is presented with more [MASK] tokens than expected.

ColBERT's [MASK]-based Query Augmentation: Effects of Quadrupling the Query Input Length

TL;DR

. They find that [MASK] embeddings tend to weight non-[MASK] terms rather than introduce new terms, with remapping generally reducing performance and the best remapping still under baseline; the optimal gains occur near the trained length of 32 tokens, after which performance plateaus, though extending to 128 tokens yields small yet sometimes significant improvements on COVID data. Overall, ColBERTv2 demonstrates robustness to longer [MASK]-augmented queries and reinforces the role of [MASK] padding as a controlled term-weighting mechanism that enhances ranking while preserving stability beyond training conditions. These findings inform prompt design and query augmentation strategies for transformer-based IR models in practical search scenarios.

Abstract

Paper Structure (9 sections, 1 equation, 3 figures, 2 tables)

This paper contains 9 sections, 1 equation, 3 figures, 2 tables.

Introduction
Related Work
Methodology
RQ1: Do [MASK] tokens primarily weight non-[MASK] tokens in a query when using ColBERTv2?
RQ2: Does effectiveness increase with the number of [MASK]s, up to four times the number ColBERT has been trained with?
Results
RQ1: Do [MASK] tokens primarily weight non-[MASK] tokens in a query when using ColBERTv2
RQ2: Does effectiveness increase as the number of [MASK]s increases up to four times the number ColBERT has been trained with?
Conclusion

Figures (3)

Figure 1: Cosine similarity of embedded tokens to each non-[MASK] token for positions 0 through 64. A cyclical pattern attending to the most relevant terms in the query (e.g. "period", "calculus", "california", "austria") can be seen, both before and after 32 tokens (the length trained with).
Figure 2: Left column: Cosine distance after tokens are switched from "what is" to "is what" in ColBERTv1 vs. ColBERTv2. We see the same trend of [Q] and [MASK] tokens having the most shifting, and an overall increase in shifting when "what is" is not a requirement (right column). The contrast between [Q] and [MASK] versus other tokens is more apparent in ColBERTv2 than ColBERTv1.
Figure 3: nDCG@10, MRR(rel$\geq$2)@10, nDCG@1000, and MAP(rel$\geq$2) increasing number of [MASK] tokens from 0 to 96 on TREC 2019-2020. The red line shows a standard length of 32 total tokens. Significant differences from the baseline indicated with a start (Bonferroni correction, $p<0.05$).

ColBERT's [MASK]-based Query Augmentation: Effects of Quadrupling the Query Input Length

TL;DR

Abstract

ColBERT's [MASK]-based Query Augmentation: Effects of Quadrupling the Query Input Length

Authors

TL;DR

Abstract

Table of Contents

Figures (3)